human motion analysis lecture 6: bayesian filteringurtasun/courses/eth10/lecture6.pdfstochastic...

105
Human Motion Analysis Lecture 6: Bayesian Filtering Raquel Urtasun TTI Chicago March 29, 2010 Raquel Urtasun (TTI-C) Bayesian Filtering March 29, 2010 1 / 69

Upload: others

Post on 23-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Human Motion AnalysisLecture 6 Bayesian Filtering

Raquel Urtasun

TTI Chicago

March 29 2010

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 1 69

Materials used for this lecture

This lecture is based on Zhe Chenrsquos paper rdquoBayesian Filtering FromKalman Filters to Particle Filters and Beyondrdquo

I would like to thank David Fleet for his slides on the subject

To know more about sampling look at David MaKayrsquos bookrdquoInformation Theory Inference and Learning Algorithmsrdquo CambridgeUniversity Press (2003)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 2 69

Contents of todayrsquos lecture

We will look into

Stochastic Filtering Theory Kalman filtering (1940rsquos by Wiener andKolmogorov)

Bayesian Theory and Bayesian Filtering (Bayes 1763 and rediscoverby Laplace)

Monte Carlo methods and Monte Carlo Filtering (Buffon 1777modern version in the 1940rsquos in physics and 1950rsquos in statistics)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 3 69

Monte Carlo approaches

Monte Carlo techniques are stochastic sampling approaches aiming totackle complex systems that are analytically intractable

Sequential Monte Carlo allows on-line estimation by combining MonteCarlo sampling methods with Bayesian inference

Particle filter sequential Monte Carlo used for parameter estimation andstate estimation

Particle filter uses a number of independent random variables calledparticles sampled directly from the state space to represent theposterior probabilityand update the posterior by involving the new observationsthe particle system is properly located weighted and propagatedrecursively according to the Bayesian rule

Particle filters is not the only way to tackle Bayesian filtering egdifferential geometry variational methods conjugate methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 4 69

A few words on Particle Filters

Kalman filtering is a special case of Bayesian filtering with linearquadratic and Gaussian assumptions (LQG)

We will look into the more general case of non-linear non-Gaussianand non-stationary distributions

Generally for non-linear filtering no exact solution can be computedhence we rely on numerical approximation methods

We will focus on sequential Monte Carlo (ie particle filter)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 5 69

Notation

y mdash the observationsx mdash the state

N mdash number of samplesyn0 mdash observations up to time n

xn0 mdash state up to time n

x(i)n mdash i-th sample at time n

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69

Concept of sampling

The true distribution P(x) can be approximated by an empirical distribution

P(x) =1

N

Nsumi=1

δ(xminus x(i))

whereint

XdP(x) =

intX

p(x)dx = 1

Figure Sample approximation to the density of prob distribution (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 2: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Materials used for this lecture

This lecture is based on Zhe Chenrsquos paper rdquoBayesian Filtering FromKalman Filters to Particle Filters and Beyondrdquo

I would like to thank David Fleet for his slides on the subject

To know more about sampling look at David MaKayrsquos bookrdquoInformation Theory Inference and Learning Algorithmsrdquo CambridgeUniversity Press (2003)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 2 69

Contents of todayrsquos lecture

We will look into

Stochastic Filtering Theory Kalman filtering (1940rsquos by Wiener andKolmogorov)

Bayesian Theory and Bayesian Filtering (Bayes 1763 and rediscoverby Laplace)

Monte Carlo methods and Monte Carlo Filtering (Buffon 1777modern version in the 1940rsquos in physics and 1950rsquos in statistics)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 3 69

Monte Carlo approaches

Monte Carlo techniques are stochastic sampling approaches aiming totackle complex systems that are analytically intractable

Sequential Monte Carlo allows on-line estimation by combining MonteCarlo sampling methods with Bayesian inference

Particle filter sequential Monte Carlo used for parameter estimation andstate estimation

Particle filter uses a number of independent random variables calledparticles sampled directly from the state space to represent theposterior probabilityand update the posterior by involving the new observationsthe particle system is properly located weighted and propagatedrecursively according to the Bayesian rule

Particle filters is not the only way to tackle Bayesian filtering egdifferential geometry variational methods conjugate methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 4 69

A few words on Particle Filters

Kalman filtering is a special case of Bayesian filtering with linearquadratic and Gaussian assumptions (LQG)

We will look into the more general case of non-linear non-Gaussianand non-stationary distributions

Generally for non-linear filtering no exact solution can be computedhence we rely on numerical approximation methods

We will focus on sequential Monte Carlo (ie particle filter)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 5 69

Notation

y mdash the observationsx mdash the state

N mdash number of samplesyn0 mdash observations up to time n

xn0 mdash state up to time n

x(i)n mdash i-th sample at time n

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69

Concept of sampling

The true distribution P(x) can be approximated by an empirical distribution

P(x) =1

N

Nsumi=1

δ(xminus x(i))

whereint

XdP(x) =

intX

p(x)dx = 1

Figure Sample approximation to the density of prob distribution (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 3: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Contents of todayrsquos lecture

We will look into

Stochastic Filtering Theory Kalman filtering (1940rsquos by Wiener andKolmogorov)

Bayesian Theory and Bayesian Filtering (Bayes 1763 and rediscoverby Laplace)

Monte Carlo methods and Monte Carlo Filtering (Buffon 1777modern version in the 1940rsquos in physics and 1950rsquos in statistics)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 3 69

Monte Carlo approaches

Monte Carlo techniques are stochastic sampling approaches aiming totackle complex systems that are analytically intractable

Sequential Monte Carlo allows on-line estimation by combining MonteCarlo sampling methods with Bayesian inference

Particle filter sequential Monte Carlo used for parameter estimation andstate estimation

Particle filter uses a number of independent random variables calledparticles sampled directly from the state space to represent theposterior probabilityand update the posterior by involving the new observationsthe particle system is properly located weighted and propagatedrecursively according to the Bayesian rule

Particle filters is not the only way to tackle Bayesian filtering egdifferential geometry variational methods conjugate methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 4 69

A few words on Particle Filters

Kalman filtering is a special case of Bayesian filtering with linearquadratic and Gaussian assumptions (LQG)

We will look into the more general case of non-linear non-Gaussianand non-stationary distributions

Generally for non-linear filtering no exact solution can be computedhence we rely on numerical approximation methods

We will focus on sequential Monte Carlo (ie particle filter)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 5 69

Notation

y mdash the observationsx mdash the state

N mdash number of samplesyn0 mdash observations up to time n

xn0 mdash state up to time n

x(i)n mdash i-th sample at time n

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69

Concept of sampling

The true distribution P(x) can be approximated by an empirical distribution

P(x) =1

N

Nsumi=1

δ(xminus x(i))

whereint

XdP(x) =

intX

p(x)dx = 1

Figure Sample approximation to the density of prob distribution (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 4: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Monte Carlo approaches

Monte Carlo techniques are stochastic sampling approaches aiming totackle complex systems that are analytically intractable

Sequential Monte Carlo allows on-line estimation by combining MonteCarlo sampling methods with Bayesian inference

Particle filter sequential Monte Carlo used for parameter estimation andstate estimation

Particle filter uses a number of independent random variables calledparticles sampled directly from the state space to represent theposterior probabilityand update the posterior by involving the new observationsthe particle system is properly located weighted and propagatedrecursively according to the Bayesian rule

Particle filters is not the only way to tackle Bayesian filtering egdifferential geometry variational methods conjugate methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 4 69

A few words on Particle Filters

Kalman filtering is a special case of Bayesian filtering with linearquadratic and Gaussian assumptions (LQG)

We will look into the more general case of non-linear non-Gaussianand non-stationary distributions

Generally for non-linear filtering no exact solution can be computedhence we rely on numerical approximation methods

We will focus on sequential Monte Carlo (ie particle filter)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 5 69

Notation

y mdash the observationsx mdash the state

N mdash number of samplesyn0 mdash observations up to time n

xn0 mdash state up to time n

x(i)n mdash i-th sample at time n

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69

Concept of sampling

The true distribution P(x) can be approximated by an empirical distribution

P(x) =1

N

Nsumi=1

δ(xminus x(i))

whereint

XdP(x) =

intX

p(x)dx = 1

Figure Sample approximation to the density of prob distribution (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 5: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

A few words on Particle Filters

Kalman filtering is a special case of Bayesian filtering with linearquadratic and Gaussian assumptions (LQG)

We will look into the more general case of non-linear non-Gaussianand non-stationary distributions

Generally for non-linear filtering no exact solution can be computedhence we rely on numerical approximation methods

We will focus on sequential Monte Carlo (ie particle filter)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 5 69

Notation

y mdash the observationsx mdash the state

N mdash number of samplesyn0 mdash observations up to time n

xn0 mdash state up to time n

x(i)n mdash i-th sample at time n

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69

Concept of sampling

The true distribution P(x) can be approximated by an empirical distribution

P(x) =1

N

Nsumi=1

δ(xminus x(i))

whereint

XdP(x) =

intX

p(x)dx = 1

Figure Sample approximation to the density of prob distribution (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 6: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Notation

y mdash the observationsx mdash the state

N mdash number of samplesyn0 mdash observations up to time n

xn0 mdash state up to time n

x(i)n mdash i-th sample at time n

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69

Concept of sampling

The true distribution P(x) can be approximated by an empirical distribution

P(x) =1

N

Nsumi=1

δ(xminus x(i))

whereint

XdP(x) =

intX

p(x)dx = 1

Figure Sample approximation to the density of prob distribution (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 7: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Concept of sampling

The true distribution P(x) can be approximated by an empirical distribution

P(x) =1

N

Nsumi=1

δ(xminus x(i))

whereint

XdP(x) =

intX

p(x)dx = 1

Figure Sample approximation to the density of prob distribution (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 8: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 9: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 10: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Some useful definitions

Definition

Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t

Definition

Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t

Definition

Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 11: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Stochastic filtering problem

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying

Figure A graphical model of the state space model (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 12: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 13: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

Figure Careful today change of notation z is now x and x is now y

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 14: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Simplified model discrete case

The generic stochastic filtering problem

xt = f(t xt ut wt) (state equation)

yt = g(t xt ut vt) (measurement equation)

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 15: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Stochastic filtering is an inverse problem

Given yn0 provided f and g are known one needs to find the bestestimate xn

This is an inverse problem Find the inputs sequentially with amapping function which yields the output data

This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique

Definition

A problem is well-posed if it satisfies existence uniqueness and stability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 16: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Intractable Bayesian problems

Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)

p(x|y) =p(y|x)p(x)int

X p(y|x)p(x)dx

Marginalization Given the joint posterior the marginal posterior

p(x|y) =

intZ

p(x z|y)dz

Expectation

Ep(x|y)[f (x)] =

intX

f (x)p(x|y)dy

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 17: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 18: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 19: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 20: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 21: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Recursive Bayesian estimation I

Let p(xn|yn0) be the conditional pdf of xn

p(xn|yn0) =p(yn0|xn)p(xn)

p(yn0)

=p(yn ynminus10|xn)p(xn)

p(yn ynminus10)

=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)

p(yn|ynminus10)p(ynminus10)

=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)

p(yn|ynminus10)p(ynminus10)p(xn)

=p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 22: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 23: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 24: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 25: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 26: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Recursive Bayesian estimation II

The posterior densisty is described with three terms

p(xn|yn0) =p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

Prior defines the knowledge of the model

p(xn|ynminus10) =

intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1

Likelihood p(yn|xn) determines the measurement noise model

Evidence which involves

p(yn|ynminus10) =

intp(yn|xn)p(xn|ynminus10)dxn

We need to define a criteria for optimal filtering

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 27: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 28: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 29: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 30: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Criteria for optimal filtering I

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

which is aimed to find the conditional mean

xn = E [xn|yn0] =

intxnp(xn|yn0)dxn

Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)

Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected

Minimax which is to find the median of posterior p(xn|yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 31: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Criteria for optimal filtering II

MMSE finds the mean

MAP finds the mode

Minimax finds the median

Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 32: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 33: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Criteria for optimal filtering III

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum conditional inaccuracy defined as

Ep(xy)[minus log p(x|y)] =

intp(x y) log

1

p(x|y)dxdy

Minimum conditional KL divergence

KL(p||p) =

intp(x y) log

p(x y)

p(x|y)p(x)dxdy

where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 34: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Criteria for optimal filtering IV

An optimal filter is rdquooptimalrdquo under a particular criteria

Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize

F(Q P) equiv EQ(x)[minus log P(x|y)]

= EQ(x)[logQ(x)

P(x|y)]minus EQ(x)[log Q(x)]

= KL(Q||P)minus H(Q)

This minimization can be done using (EM) algorithm

Q(xn+1) larr argmaxQ

F(Q P)

xn+1 larr argmaxx

F(Q P)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 35: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Which criteria to choose

All these criteria are valid for state and parameter estimation

MMSE requires the computation of the prior likelihood and evidence

MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive

MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass

A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small

Hence the width of the mode is more important than its height in thehigh-dimensional case

The last three criteria are all ML oriented They are very related

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 36: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Bayesian filtering

The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE

E [||xn minus xn||22|yn0] =

int||xn minus xn||22p(xn|yn0)dxn

Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities

As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)

In general we can only seek a suboptimal or locally optimal solution

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 37: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Kalman filter revisited

In practice we are interested in the discrete simplified case

xn+1 = f(xnwn)

yn = g(xn vn)

When the dynamic system is linear Gaussian this reduces to

xn+1 = Fn+1nxn + wn

yn = Gnxn + vn

with Fn+1n the transition matrix and Gn the measurement matrix

This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically

In the general case it is not tractable and we will rely on approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 38: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Kalman filter Forward equations I

We start by defining the messages

α(zn) = N (zn|micronVn)

Using the HMM recursion formulas for continuous variables we have

cnα(zn) = p(xn|zn)

intα(znminus1)p(zn|znminus1)dznminus1

Substituting the conditionals we have

cnN (zn|micron Vn) = N (xn|Czn Σ)

ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1

= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)

Here we assume that micronminus1 and Vnminus1 are known and we have defined

Pnminus1 = AVnminus1AT + Γ

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 39: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Kalman filter Forward equations II

Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn

micron = Amicronminus1 + Kn(xn minus CAmicronminus1)

Vn = (IminusKnC)Pnminus1

cn = N (xn|CAmicronminus1CPnminus1CT + Σ)

where the Kalman gain matrix is defined as

Kn = Pnminus1CT (CPnminus1CT + Σ)minus1

The initial conditions are given by

micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0

c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1

Interpretation is making prediction and doing corrections with Kn

The likelihood can be computed as p(X) =prodN

n=1 cn

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 40: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Optimum non-linear filters

The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world

The nonlinear filtering consists in finding p(x|yn0)

The number of variables is infinite but not all of them are of equalimportance

Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation

Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 41: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Extended Kalman filter (EKF)

Recall the equations of motion

xn+1 = f(xnwn)

yn = g(xn vn)

These equations are linearized in the EKF

Fn+1n =df(x)

dx

∣∣∣∣x=xn

Gn+1n =dg(x)

dx

∣∣∣∣x=xn|nminus1

Then the conventional Kalman filter can be employed

Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)

A more general solution is to rely on numerical approximations

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 42: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 43: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 44: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 45: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 46: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Monte Carlo sampling

Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint

X

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2

N

By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability

The convergence rate is assessed by the Central Limit Theoremradic

N(

fN minus E [f ])sim N (0 σ2)

where σ2 is the variance of f (x) The error rate is of order O(Nminus12)

An important property is that the estimation accuracy is independent of thedimensionality of the state space

The variance of estimate is inversely proportional to the number of samples

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 47: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Fundamental problems of Monte Carlo estimation

Monte carlo methods approximateintX

f (x)dP(x) asymp 1

N

Nsumi=1

f(

x(i))

= fN

There are two fundamental problems

How to drawn samples from a probability distribution P(x)

How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =

intf (x)dP(x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 48: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Important properties of an estimator

Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity

Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value

Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators

Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model

Minimal variance

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 49: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Types of Monte Carlo sampling

Importance sampling (IS)

Rejection sampling

Sequential importance sampling

Sampling-importance resampling

Stratified sampling

Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling

Hybrid Monte Carlo (HMC)

Quasi-Monte Carlo (QMC)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 50: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 51: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Figure Importance sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 52: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Importance Sampling I

Sample the distribution in the region of importance in order to achievecomputational efficiency

This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small

The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint

f (x)p(x)dx =

intf (x)

p(x)

q(x)q(x)dx

Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate

f =1

N

Nsumi=1

W (x(i))f (x(i))

where W (x(i)) = p(x(i))q(x(i)) are called the importance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 53: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Importance Sampling II

If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant

To ensure that we importance weights are normalized

f =Nsum

i=1

W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN

i=1 W (x(i))

The variance of the estimate is given by

Var[f ] =1

NVar[f (x)W (x)] =

1

NVar[f (x)

p(x)

q(x)]

=1

N

int (f (x)p(x)

q(x)

)2

dxminus (E [f (x)])2

N

The variance can be reduced when q(x) is chosen to

match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance

The estimator is biased but consistent

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 54: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Remarks on importance sampling

It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)

it can be used when encountering the difficulty to sample from the trueprobability distribution directly

The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers

If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions

In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights

Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 55: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)

if u gtp(x)

Cq(x)then

Repeat samplingend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 56: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

Figure Importance (left) and Rejection (right) sampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 57: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Rejection sampling

Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density

Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling

The acceptance probability for a random variable is inversely proportional tothe constant C

The choice of C is critical

if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low

If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then

p(x|y) =p(y|x)p(x)

p(y)le Cq(x)

p(y)equiv C primeq(x)

and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)

C

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 58: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Remarks on rejection sampling

The draws obtained from rejection sampling are exact

The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable

It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 59: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Importance Sampling I

A good proposal distribution is essential to the efficiency of importancesampling

but it is usually difficult to find a good proposal distribution especially ina high-dimensional space

A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling

if the proposal distribution is chosen in a factorized form

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

then the importance sampling can be performed recursively

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 60: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Importance Sampling II

According to the telescope law of probability we have

p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)

q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)

The weights can be recursively calculated as

Wn(xn0) =p(xn0)

q(xn0)= Wnminus1(xn0)

p(xn|xnminus10)

qn(xn|xnminus10)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 61: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Remarks on Sequential Importance Sampling

The advantage of SIS is that it doesnt rely on the underlying Markov chain

Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency

The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate

The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero

We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 62: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sampling Importance Resampling (SIR)

The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf

The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights

Sample N random samples x(i)Ni=1 from q(x)

for i = 1 middot middot middot N do

W (i) prop p(x(i))

q(x(i))

end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResample with replacement N times from the discrete set x(i)N

i=1 where the probability of

resampling from each x(i) is proportional to W (x(i))

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 63: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 64: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 65: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 66: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Remarks on Sampling Importance Resampling

Resampling can be taken at every step or only taken if regarded necessary

Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold

The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)

Resampling is important because

if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler

Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation

There are many types of resampling methods

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 67: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Gibbs sampling

Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling

The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling

Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample

Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model

Draw a sample from x0 sim p(x0)for n = 1 to M do

for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)

end forend for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 68: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Illustration of Gibbs sampling

Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 69: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Other sampling strategies

Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space

Stratified sampling works very well and is efficient in a not-too-highdimension space

Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior

This is good since the gradient direction might indicate the way to findthe state with a higher probability

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 70: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Numerical approximations

Monte-carlo sampling approximation (ie particle filter)

GaussianLaplace approximation

Iterative quadrature

Multi-grid method and point-mass approximation

Moment approximation

Gaussian sum approximation

Deterministic sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 71: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 72: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 73: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

GaussLaplace approximation

Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability

By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method

Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further

compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12

The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x

It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate

Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 74: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Iterative Quadrature

Numerical approximation method which was widely used in computergraphics and physics

A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b

a

f (x)p(x)dx asympmsum

k=1

ck f (xk )

where p(x) is treated as a weighting function and xk is the quadraturepoint

The values xk are determined by the weighting function p(x) in the interval[a b]

This method can produce a good approximation if the nonlinear function issmooth

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 75: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Muti-grid Method and Point-Mass Approximation

If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)

If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density

The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density

In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 76: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Moment Approximation

Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments

We can empirically use the sample moment to approximate the truemoment namely

mk = E [xk ] =

intX

xk p(x)dx =1

N

Nsumi=1

|x(i)|k

where mk denotes the k-th order moment and x(i) are the samples from truedistribution

The computation cost of these approaches are rather prohibitive especiallyin highdimensional space

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 77: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Gaussian Sum Approximation

Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)

p(x) =msum

j=1

cjN (xf Σf )

where the weighting coefficients cj gt 0 andsumm

j=1 cj = 1

Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities

A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments

Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 78: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Illustration of numerical approximations

Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 79: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

What have we seen

We have seen up to now

Filtering equations

Monte Carlo sampling

Other numerical approximation methods

Whatrsquos next

Particle filters

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 80: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 81: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 82: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Particle filter Sequential Monte Carlo estimation

Now we now how to do numerical approximations Letrsquos use it

Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter

The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated

The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation

Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf

However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 83: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 84: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Monte Carlo estimation I

The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution

p(xn|yn0) asymp 1

N

Nsumi=1

δ(xn minus x(i)n ) equiv p(xn|yn0)

where x(i)n are assumed to be iid drawn from p(xn|yn0)

By this approximation we can estimate the mean of a nonlinear function

E [f (xn)] asympint

f (xn)p(xn|yn0)dxn

=1

N

Nsumi=1

intf (xn)δ(xn minus x(i)

n )dxn

=1

N

Nsumi=1

f (x(i)n ) equiv fN (x)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 85: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 86: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 87: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 88: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 89: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 90: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Monte Carlo estimation II

It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define

Wn(xn) =p(yn0|xn)p(xn)

q(xn|yn0)

We can then write

E [f (xn)] =

intf (xn)

p(xn|yn0)

q(xn|yn0)q(xn|yn0)dxn

=

intf (xn)

Wn(xn)

p(yn0)q(xn|yn0)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

p(yn0|xn)p(xn)dxn

=

intf (xn)Wn(xn)q(xn|yn0)dxnint

Wn(xn)q(xn|yn0)dxn

=Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 91: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Monte Carlo estimation III

We have written

E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]

Eq(xn|yn0)[Wn(xn)]

By drawing the iid samples x(i)n from q(xn|yn0) we can approximate

E [f (xn)] asymp1N

sumNi=1 Wn(x

(i)n )f (x

(i)n )

1N

sumNi=1 Wn(x

(i)n )

=Nsum

i=1

W (x(i)n )f (x(i)

n ) equiv f (x)

where the normalized weights are defined as

W (x(i)n ) =

Wn(x(i)n )sumN

i=1 Wn(x(i)n )

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 92: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Monte Carlo estimation IV

Suppose now that the proposal distribution factorizes

q(xn0|yn0) = q(x0)nprod

t=1

q(xt |xtminus10 yt0)

As before the posterior can be written as

p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)

p(yn|ynminus10)

We can then create a recursive rule to update the weights

W (i)n =

p(x(i)n0|yn0)

q(x(i)n0|yn0)

propp(yn|x(i)

n )p(x(i)n |x(i)

nminus1)p(x(i)nminus10|ynminus10)

q(x(i)n |x(i)

nminus10 yn0)q(x(i)nminus10|ynminus10)

= W(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn0)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 93: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Types of filters

Depending on the type of sampling use we have different types of filters

Sequential Importance sampling (SIS) filter

SIR filter

Auxiliary particle filter (APF)

Rejection particle filter

MCMC particle filter

etc

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 94: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Importance sampling (SIS) filter I

We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)

Letrsquos assume that q(x(i)n |x(i)

nminus10 yn0) = q(x(i)n |x(i)

nminus10 yn) then we can write

W (i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases

After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 95: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Sequential Importance sampling (SIS) filter II

A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step

A suggested measure for degeneracy is the so-called effective sample size

Neff =N

Eq(middot|yn0)[(W (xn0))2]le N

In practice this cannot be computed so we approximate

Neff asymp1sumN

i=1(W (xn0))2

When Neff is below a threshold P then resampling is performed

Neff can be also used to combine rejection and importance sampling

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 96: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

SIS particle filter with resampling

for n = 0 middot middot middot T dofor i = 1 middot middot middot N do

Draw samples x(i)n sim q(xn|x(i)

nminus10 yn0)

Set x(i)n0 = x(i)

nminus10 x(i)n

end forfor i = 1 middot middot middot N do

Calculate weights W(i)n = W

(i)nminus1

p(yn|x(i)n )p(x

(i)n |x

(i)nminus1)

q(x(i)n |x

(i)nminus10yn)

end forfor i = 1 middot middot middot N do

Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forCompute Neff = 1PN

i=1(W (xn0))2

if Neff lt P then

Generate new x(j)n by resampling with replacement N times from x(i)

n0 with

probability P(x(j)n0 = x

(i)n0) = W

(i)n0

Reset the weights W(i)n = 1

Nend if

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 97: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

BootstrapSIR filter

The key idea of SIR filter is to introduce the resampling step as in theSIR sampling

Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights

It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 98: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

SIR filter using transition prior as proposal distribution

for i = 1 middot middot middot N do

Sample x(i)0 sim p(x0)

Compute W(i)0 = 1

Nend forfor n = 0 middot middot middot T do

for i = 1 middot middot middot N do

Importance sampling x(i)n sim p(xn|x(i)

nminus1)end forSet x

(i)n0 = x(i)

nminus10 x(i)n

for i = 1 middot middot middot N do

Weight update W(i)n = p(yn|x(i)

n )end forfor i = 1 middot middot middot N do

Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))

end forResampling Generate N new particles x

(i)n from the set x(i)

n according to W(i)n

end for

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 99: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Illustration of a generic particle filter

Figure Particle filter with importance sampling and resampling (Chen 03)

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 100: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Remarks on SIS and SIR filters

In the SIR filter the resampling is always performed

In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive

The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance

Normally the posterior estimate (and its relevant statistics) should becalculated before resampling

In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies

To alleviate the sample degeneracy in SIS filter we can change

Wn = W αnminus1

p(yn|x(i)n )p(x

(i)n |x(i)

nminus1)

q(x(i)n |x(i)

nminus10 yn)

where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 101: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Popular CONDENSATION

Figure CONDENSATION

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69

animmp4
Media File (videomp4)

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 102: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Popular CONDENSATION

Figure Head tracking Figure Leaf tracking

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69

dancemvmp4
Media File (videomp4)
leafmvmp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 103: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Popular CONDENSATION

Figure Hand tracking Figure Hand drawing

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69

handmp4
Media File (videomp4)
iccv98mp4
Media File (videomp4)

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 104: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

Popular CONDENSATION

Figure Hand tracking Figure Interactive applications

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69

impmp4
Media File (videomp4)
drawingmp4
Media File (videomp4)

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions
Page 105: Human Motion Analysis Lecture 6: Bayesian Filteringurtasun/courses/ETH10/lecture6.pdfStochastic Filtering Theory: Kalman ltering (1940’s by Wiener and Kolmogorov). Bayesian Theory

More

If you want to learn more look at the additional material

Otherwise do the research project on this topic

Next week we will do human pose estimation

Letrsquos do some exercises now

Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69

  • Introduction
  • Monte Carlo sampling
    • Introduction
    • Types
    • Importance Sampling
    • Rejection sampling
    • Sequential Importance Sampling
    • Sequential Importance Sampling
    • Gibbs sampling
    • Other sampling
      • Other numerical approximations
        • Laplace approximation
        • Iterative Quadrature
        • Multi-grid approximation
        • Moment approximation
        • Gaussian sum approximation
          • Particle filters
            • Introduction
            • SIR filter
            • BootstrapSIR filter
            • Condensation
              • Conclusions