model based learning of sigma points in unscented kalman filtering

Neurocomputing 80 (2012) 47–53

Contents lists available at SciVerse ScienceDirect

Neurocomputing

0925-23

doi:10.1

n Corr

E-m

cer54@c

journal homepage: www.elsevier.com/locate/neucom

Model based learning of sigma points in unscented Kalman filtering

Ryan Turner n, Carl Edward Rasmussen

University of Cambridge, Department of Engineering, Trumpington Street, Cambridge CB2 1PZ, UK

a r t i c l e i n f o

Available online 6 November 2011

Keywords:

Unscented Kalman filtering

Sigma points

State-space

Machine learning

Global optimization

Gaussian process

12/$ - see front matter & 2011 Elsevier B.V. A

016/j.neucom.2011.07.029

esponding author. Tel.: þ44 1223 748511; fa

ail addresses: [email protected] (R. Turner),

am.ac.uk (C.E. Rasmussen).

a b s t r a c t

The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The

UKF suffers from arbitrary parameters necessary for sigma point placement, potentially causing it to

perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a

learning problem in a model based view. We demonstrate that learning to place the sigma points

correctly from data can make sigma point collapse much less likely. Learning can result in a significant

increase in predictive performance over default settings of the parameters in the UKF and other filters

designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower

computational complexity than the other methods. We call our method UKF-L.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Filtering in linear dynamical systems (LDS) and nonlineardynamical systems (NLDS) is frequently used in many areas, suchas signal processing, state estimation, control, and finance/econo-metric models. Filtering (inference) aims to estimate the state of asystem from a stream of noisy measurements. Imagine trackingthe location of a car based on odometer and global positioningsystem (GPS) sensors, both of which are noisy. Sequential mea-surements from both sensors are combined to overcome the noisein the system and to obtain an accurate estimate of the systemstate. Even when the full state is only partially measured, it canstill be inferred; in the car example the engine temperature isunobserved, but can be inferred via the nonlinear relationshipfrom acceleration. To exploit this relationship appropriately,inference techniques in nonlinear models are required; they playan important role in many practical applications.

LDS and NLDS belong to a class of models known as state-spacemodels. A state-space model assumes that there exists a sequence oflatent states xt that evolve over time according to a Markovianprocess specified by a transition function f. The latent states areobserved indirectly in yt through a measurement function g. Weconsider state-space models given by

xt ¼ f ðxt�1Þþe , xt ARM ,

yt ¼ gðxtÞþm, yt ARD: ð1Þ

Here, the system noise e �N ð0,REÞ and the measurement noisem �N ð0,RnÞ are both Gaussian. In the LDS case, f and g are linearfunctions, whereas the NLDS covers the general nonlinear case.

ll rights reserved.

x: þ44 1223 332662.

Kalman filtering [1] corresponds to exact (and fast) inferencein the LDS, which can only model a limited set of phenomena. Forthe last few decades, there has been interest in NLDS for moregeneral applicability. In the state-space formulation, nonlinearsystems do not generally yield analytically tractable algorithms.

The most widely used approximations for filtering in NLDS arethe extended Kalman filter (EKF) [2], the unscented Kalman filter(UKF) [3], and the cubature Kalman filter (CKF) [4]. The EKFlinearizes f and g at the current estimate of xt and treats the systemas a nonstationary linear system even though it is not. The UKF andCKF propagate several estimates of xt through f and g and recon-structs a Gaussian distribution assuming the propagated valuescame from a linear system. The locations of the estimates of xt areknown as the sigma points. Many heuristics have been developed tohelp set the sigma point locations [5]. Unlike the EKF, the UKF hasfree parameters that determine where to put the sigma points.

Our contribution is a strategy for improving the UKF through anovel learning algorithm for appropriate sigma point placement:we call this method UKF-L. The key idea in the UKF-L is that theUKF and EKF are doing exact inference in a model that is some-what ‘‘perverted’’ from the original model described in the statespace formulation. The interpretation of EKF and UKF as models,not merely approximate methods, allows us to better identifytheir underlying assumptions. This interpretation also enables usto learn the free parameters in the UKF in a model based mannerfrom training data. If the settings of the sigma point are a poor fitto the underlying dynamical system, the UKF can make horren-dously poor predictions.

2. Unscented Kalman filtering

We first review how filtering and the UKF works and thenexplain the UKF’s generative assumptions. Filtering methods consist

www.elsevier.com/locate/neucom

www.elsevier.com/locate/neucom

dx.doi.org/10.1016/j.neucom.2011.07.029

mailto:[email protected]

mailto:[email protected]

dx.doi.org/10.1016/j.neucom.2011.07.029

−1

−0.5

0

0.5

1

−10 −5 0 5 10

Fig. 1. An illustration of a good and bad assignment of sigma points. The lower

panel shows the true input distribution. The center panel shows the sinusoidal

system function f (blue) and the sigma points for a¼ 1 (red crosses) and a¼ 0:68

(green rings). The left panel shows the true predictive distribution (shaded), the

predictive distribution under a¼ 1 (red spike) and a¼ 0:68 (green). Using a

different set of sigma points we get either a completely degenerate solution

R. Turner, C. Edward Rasmussen / Neurocomputing 80 (2012) 47–5348

of three steps: time update, prediction step, and measurementupdate. They iterate in a predictor-corrector setup. In the timeupdate we find pðxt9y1:t�1Þ

pðxt9y1:t�1Þ ¼

Zpðxt9xt�1Þpðxt�19y1:t�1Þ dxt�1 ð2Þ

using pðxt�19y1:t�1Þ. In the prediction step we predict the observedspace, pðyt9y1:t�1Þ using pðxt9y1:t�1Þ

pðyt9y1:t�1Þ ¼

Zpðyt9xtÞpðxt9y1:t�1Þ dxt : ð3Þ

Finally, in the measurement update we find pðxt9ytÞ usinginformation from how good (or bad) the prediction in theprediction step is

pðxt9y1:tÞppðyt9xtÞpðxt9y1:t�1Þ: ð4Þ

In the linear case all of these equations can be done analyticallyusing matrix multiplications. The EKF explicitly linearizes f and g

at the point E½xt� at each step. The UKF uses the whole distribu-tion on xt , not just the mean, to place sigma points and implicitlylinearize the dynamics, which we call the unscented transform

(UT). In one dimension the sigma points roughly correspond tothe mean and a-standard deviation points; the UKF generalizesthis idea to higher dimensions. The exact placement of sigmapoints depends on the unitless parameters fa,b,kgARþ through

X0 :¼ l, X i :¼ l7 ðffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðDþlÞR

pÞi, ð5Þ

l :¼ a2ðDþkÞ�D, ð6Þ

whereffiffi�p

i refers to the ith row of the Cholesky factorization.1 Thesigma points have weights assigned by

w0m :¼

lDþl

, w0c :¼

lDþl

þð1�a2þbÞ, ð7Þ

wim :¼ wi

c :¼1

2ðDþlÞ, ð8Þ

where wm is used to reconstruct the predicted mean and wc usedfor the predicted covariance. We interpret the unscented trans-form as approximating the input distribution by 2Dþ1 pointmasses at X with weights w. Once the sigma points X have beencalculated the filter accesses f and g as black boxes to find Yt ,either f ðX tÞ or gðX tÞ depending on the step. The UKF reconstructsa Gaussian predictive distribution with the mean and covariancedetermined from Yt pretending the dynamics had been linear.In other words, the equation used to find the approximatingGaussian is such that the UKF is exact for linear dynamics modelf and observation model g. It does not guarantee the momentswill match the moment of the true non-Gaussian distribution. TheUKF is a black box filter as opposed to the EKF which requiresderivatives of f and g.

Both the EKF and the UKF approximate the nonlinear state-space as a nonstationary linear system. The UKF defines its owngenerative process, which linearizes the nonlinear functions f and g

wherever in xt a UKF filtering the time series would expect xt

to be. Therefore, it is possible to sample synthetic data fromthe UKF by sampling from its one-step-ahead predictions as seenin Algorithm 1. The sampling procedure augments the filter:predict-sample-correct. If we use the UKF with the same fa,b,kgused to generate synthetic data, then the one-step-ahead pre-dictive distribution will be the exact same distribution the datapoint was sampled from. Note that in an LDS, if we sample fromthe one-step-ahead predictions of a Kalman filter we are sampling

1 IfffiffiffiPp¼A) P¼A>A, then we use the rows in (5). If P¼AA> , then we use

the columns.

from the same process as if we sampled from the latent states xand then sampled from the observation model to get y.

Algorithm 1. Sampling data from UKF’s implicit model

(a delta s

approxima

(For inter

referred to

1:
pðx19|Þ’ðl0,R0Þ
2:
for t¼1 to T do 3: Prediction step: pðyt9y1:t�1Þ using pðxt9y1:t�1Þ
4:
Sample yt from prediction step distribution 5: Measurement update: pðxt9y1:tÞ using yt
6:
Time update: find pðxtþ19y1:tÞ using pðxt9y1:tÞ
7:
end for
2.1. Setting the parameters

We summarize all the parameters as y :¼ fa,b,kg. For anysetting of y the UKF will give identical predictions to the Kalmanfilter if f and g are both linear. Many of the heuristics for setting yassume f and g are linear (or close to it), which is not the problemthe UKF solves. For example, one of the heuristics for setting y isthat b¼ 2 is optimal if the state distribution pðxt9y1:tÞ is exactlyGaussian [6]. However, the state distribution will seldom beGaussian unless the system is linear, in which case any settingof y gives exact inference! It is often recommended to set theparameters to a¼ 1, b¼ 0, and k¼ 3�D [7,8]. The CKF can beconstructed as a special case of a UKF with a¼ 1, b¼ 0, and k¼ 0.

3. The Achilles’ heel of the UKF

The UKF can have embarrassingly poor performance becauseits predictive variances can be far too small if the sigma points areplaced in unlucky locations. Insufficient predictive variance willcause observations to have too much weight in the measurementupdate, which causes the UKF to fit to noise. Meaning, the UKFwill perform poorly even when evaluated on root-mean-square-error (RMSE), which only uses the predictive mean. On the NLL,the situation is even worse where too small predictive variancesare heavily penalized.

In the most extreme case, the UKF can give a delta spikepredictive distribution. We call this sigma point collapse. As seenin Fig. 1, when the sigma points are arranged together horizon-tally the UKF has no way to know the function varies anywhere.We aim to learn the parameters y in such a way that collapse

pike) or a near optimal approximation within the class of Gaussian

tions. The parameters b and k are fixed at their defaults in this example.

pretation of the references to color in this figure legend, the reader is

the web version of this article.)

0 30

2

4

6

8

log

likel

ihoo

d (n

ats/

obs)

1 2 4−2.5

−2

−1.5

−1

−0.5

d

0 2 30

0.5

1

log

likel

ihoo

d (n

ats/

obs)

1 4−1.5

−1

−0.5

d

0

1

2

log

likel

ihoo

d (n

ats/

obs)

1 2 3 4 5−2

−1

0

d

Fig. 2. Illustration of the UKF when applied to a pendulum system. Cross-section of the marginal likelihood (blue line) varying the parameters one at a time from the

defaults (red vertical line). We shift the marginal likelihood, in nats per observation, to make the lowest value zero. The dashed green line is the total variance diagnostic

d :¼ E½logð9R9=9R09Þ�, where R is the predictive variance in one-step-ahead prediction. We divide out the variance R0 of the time series when treating it as iid to make d

unitless. Values of y with small predictive variances closely track the y with low marginal likelihood. Note that the ‘‘noise’’ is a result of the sensitivity of the predictions to

parameters y due to sigma point collapse, not randomness in the algorithm as is the case with a particle filter. (For interpretation of the references to color in this figure

legend, the reader is referred to the web version of this article.)

R. Turner, C. Edward Rasmussen / Neurocomputing 80 (2012) 47–53 49

becomes unlikely. Anytime collapse happens in training themarginal likelihood will be substantially lower. Hence, thelearned parameters will avoid anywhere this delta spike occurredin training. Maximizing the marginal likelihood is tricky since it isnot well-behaved for settings of y that cause sigma point collapse.

4. Model based learning

A common approach to estimating model parameters y ingeneral is to maximize the log marginal likelihood

‘ðyÞ :¼ log pðy1:T9yÞ ¼XT

t ¼ 1

log pðyt9y1:t�1,yÞ: ð9Þ

Hence we can equivalently maximize the sum from the one-step-ahead predictions. One might be tempted to apply a gradientbased optimizer on (9), but as seen in Fig. 2 the marginallikelihood can be very unstable. The instability in the likelihoodis likely the result of the phenomenon explained in Section 3,where a slight change in parameterization can avoid problematicsigma point placement. This makes the application of a gradient-based optimizer hopeless.

We could apply Markov chain Monte Carlo (MCMC) andintegrate out the parameters. However, this is usually ‘‘over kill’’as the posterior on y is usually highly peaked unless T is verysmall. Tempering must be used as mixing will be difficult if thechain is not initialized inside the posterior peak. Even in the casewhen T is small enough to spread the posterior out, we wouldstill like a single point estimate for computational speed on thetest set.2

We focus on learning using a Gaussian process (GP) basedoptimizer [10,11]. Since the marginal likelihood surface has anunderlying smooth function but contains what amounts toadditive noise, a probabilistic regression method seems a naturalfit for finding the maximum.

5. Gaussian process optimizers

Gaussian processes form a prior over functions. Estimating theparameters amounts to finding the maximum of a structuredfunction: the log marginal likelihood. Therefore, it seems naturalto use a prior over functions to guide our search. Given thatGaussian processes form a prior over functions we can use them

2 If we want to integrate the parameters out we must run the UKF with each

sample of y9y1:T during test and average. To get the optimal point estimate of the

posterior we would like to compute the Bayes’ point [9].

for global optimization. The same principle has been applied tointegration in Rasmussen and Ghahramani [12].

GP optimization (GPO) allows for effective derivative free

optimization. We consider the maximization of a likelihood func-tion ‘ðyÞ. GPs allow for derivative information @y‘ to be included aswell, but in our case that will not be very useful due to thefunction’s instability.

GPO treats optimization as a sequential decision problem in aprobabilistic setting, receiving reward r when using the rightinput y to get a large function value output ‘ðyÞ. This setup isalso known as a Markov decision process (MDP) [13, Chapter 1].At each step GPO uses its posterior over the objective functionpð‘ðyÞÞ to look for y it believes has a large function value ‘ðyÞ.A maximization strategy that is greedy will always evaluate thefunction pð‘ðyÞÞ where the mean function E½‘ðyÞ� is the largest.A strategy that trades-off exploration with exploitation will takeinto account the posterior variance Var½‘ðyÞ�. Areas of y with highvariance carry a possibility of having a large function value orhigh reward r. The optimizer is programmed to evaluate at themaxima of

JðyÞ :¼ E½‘ðyÞ�þKffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiVar½‘ðyÞ�

p, ð10Þ

where K is a constant to control the exploration exploitationtrade-off. The objective J is known as the upper confidence bound(UCB), since it optimizes the maximum of a confidence interval onthe function value. The UCB has recently been analyzed theore-tically by Srinivas et al. [11]. The optimizer must also find themaximum of J, but since it is a combination of the GP mean andvariance functions it is easy to optimize with gradient methods.

We summarize the method in Algorithm 2, the subroutine tocompute J is shown in Algorithm 3.3 GPO assumes that we providea feasible set of y to search within. We can do the optimizationusing ‘ as the objective h in Algorithm 2. Algorithm 3 optionallyadds a barrier to make sure that new candidate points forevaluation stay within the feasible set. For instance, if the feasibleset is the unit cube, the 20th norm JyJ20 forms an effective barrier.Alternatively, a constrained optimization routine could be used.Every iteration we consider another set of C candidate points toevaluate ‘ðyÞ.

We illustrate the iterations of GPO in Fig. 3. The function in thefigure is highly non-convex and the global approach of the GPhelps greatly. We consider a feasible region of yA ½0;10� andinitialize the search by evaluating the function at the edges and

3 We use the notation A\y equivalently to A�1y except that we expect the

computation to be done using matrix back substitution rather than explicit

calculation of the inverse and then multiplying.

0 1 2 3 4 5 6 7 8 10−2.5

−2−1.5

−1−0.5

00.5

11.5

22.5

θ

Log

Like

lihoo

d

0 1 2 3 4 5 6 7 8 10−2.5

−2−1.5

−1−0.5

00.5

11.5

22.5

θ

Log

Like

lihoo

d

0 1 2 3 4 5 6 7 8 10−2.5

−2−1.5

−1−0.5

00.5

11.5

22.5

θ

Log

Like

lihoo

d

0 1 2 3 4 5 6 7 8 9 10−2.5

−2−1.5

−1−0.5

00.5

11.5

22.5

θ

Log

Like

lihoo

d

−2.5−2

−1.5−1

−0.50

0.51

1.52

2.5Lo

g Li

kelih

ood

0 1 2 3 4 5 6 7 8 10−2.5

−2−1.5

−1−0.5

00.5

11.5

22.5

θ

Log

Like

lihoo

d

9

9

9

0 1 2 3 4 5 6 8 9 10θ

7

9

9

Fig. 3. Illustration of GPO in one dimension. The red line represents the (highly non-convex) function we are attempting to optimize with respect to its input y. Its true

maximum is shown by the green circle. The black line and the shaded region represent the GP predictive mean and two standard deviation error bars. The blackþshows

the points that have already been evaluated. The red � is the maximum of JðyÞ shown by the red horizontal line. Here we use K¼2 so the UCB criterion J corresponds to the

top of the two standard deviation error bars shown in the plot. Every iteration GPO merely evaluates the function where the top of the error bars is highest. (For

interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)


the midpoint: {0,5,10}. The figure illustrates how the UCB criterionJ trades-off exploitation with exploration. For instance, a purelyexplorative strategy would evaluate the function at y¼ 9 inFig. 3(f), since the error bars are the largest even though theoptima does not occur there. Likewise, a purely (greedy) exploita-tive strategy would get stuck evaluating the function at y¼ 5 inFig. 3(a) and never discover the maximum at y¼ 1.

Algorithm 2. Gaussian process optimization

1:
function GPO(Range,h,C)
2:
Set E to be the dimension of inputs x
3:
x Use end points and midpoints of feasible set 4: x’multiGridðRangeÞARE�3E
5:
x Evaluate hðxÞ at all 3 � E grid points
6:
y’hðxÞAR3E
7:
for i¼1 to maximum function evaluations do 8: Pre-compute L, Cholesky of GP cov. matrix 9: for j¼1 to C do 10: x Sample a random initial point 11: x0 �N ðE½x�,Cov xÞARE
12:
x Maximize the UCB criterion J w.r.t. x%
13:
x x% initialized at x0
14:
ðSj,FjÞ’maxx%UCBðx%,x,y,K ,LÞ
15:
end for 16: x Find best init. zAf1, . . . ,Cg from list of
candidates solutions S and values F

17:
append SðargmaxzFÞ to x x x now RE�3Eþ i
18:
append hðSðargmaxzFÞÞ to y x y now R3Eþ i
19:
end for
20:
Return xðargmax yÞ 21: end function
Algorithm 3. Upper confidence bound

1:
function UCB(x%,x,y,K ,L) 2: x compute E½x%9x,y� and Var½x%9x,y� using a GP in
numerically stable way
3: find K%%ARþ and Kx,%AR9x9�1
4:
a’L>\L\y 5: h’K>x,%a x E½x%9x,y�
6:
b’L>\L\Kx,%
7:
hsd’ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiK%%�K>x,%b
qx

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiVar½x%9x,y�

q
8: J’hþKhsd x find the UCB 9: Add a barrier to J, e.g. Jx%J20 x Optional 10: Compute dJ w.r.t. x%
11:
Return J and its derivatives dJ
12:
end function
6. Experiments and results

We test our method on filtering in three dynamical systems:the sinusoidal dynamics used in Turner et al. [14], the Kitagawadynamics used in Deisenroth et al. [15], Kitagawa [16], andpendulum dynamics used in Deisenroth et al. [15]. The sinusoidaldynamics are described by

xtþ1 ¼ 3 sinðxtÞþw, w�N ð0;0:12Þ, ð11Þ


yt ¼ sðxt=3Þþv, v�N ð0;0:12Þ, ð12Þ

where sð�Þ represents a logistic sigmoid. The Kitagawa model isdescribed by

xtþ1 ¼ 0:5xtþ25xt

1þx2t

þw, w�N ð0;0:22Þ, ð13Þ

yt ¼ 5 sinð2xtÞþv, v�N ð0;0:012Þ: ð14Þ

The Kitagawa model was presented as a filtering problem inKitagawa [16]. The pendulum dynamics is described by a dis-cretized ordinary differential equation (ODE) at Dt ¼ 400 ms. Thependulum possesses a mass m¼ 1 kg and a length l¼1 m. Thependulum angle j is measured anti-clockwise from the upwardposition. The state x¼ ½j _j�T of the pendulum is given by theangle j and the angular velocity _j. The ODE is

d

dt

_jj

" #¼

�mlg sin jml2

_j

264

375, ð15Þ

where g is the acceleration of gravity. This model is commonlyused in stochastic control for the inverted pendulum problem[17]. The measurement function is

yt ¼

arctanp1�l sinðjtÞ

p1�l cosðjtÞ

� �

arctanp2�l sinðjtÞ

p2�l cosðjtÞ

� �26664

37775,

p1

p2

" #¼

1

�2

� �, ð16Þ

which corresponds to bearings only measurement since we donot directly observe the velocity. We use system noise Rw ¼

diagð½0:12 0:32�Þ and Rv ¼ diagð½0:22 0:22

�Þ as observation noise.For all the problems we compare the UKF-L with the UKF-D,

CKF, EKF, GP-UKF, and GP assumed density filter (GP-ADF), andtime independent model (TIM); we use UKF-D to denote a UKFwith default parameter settings, and UKF-L for learned

Table 1Comparison of the methods on the sinusoidal, Kitagawa, and pendulum dynamics. The m

t-test under the null hypothesis UKF-L is the same or worse as the other methods. NLL i

respectively. Since the observations in the pendulum data are angles we projected the

Method NLL p-Value MSE

Sinusoid (T¼500 and R¼10)

UKF-D 10�1��4:5870:168 o0:0001 10�2

� 2:327UKF-Ln �5:5370:243 N/A 1:9270:0799EKF �1.9470.355 o0:0001 3.0370.127

CKF �2.0770.390 o0:0001 2.2670.100

GP-ADF �4.1370.154 o0:0001 2.5770.0940

GP-UKF �3.8470.175 o0:0001 2.6570.0985

TIM �0.77970.238 o0:0001 4.5270.141

Kitagawa (T¼10 and R¼200)

UKF-D 100� 3:7870:662 o0:0001 100

� 5:4270

UKF-Ln 2:2470:369 N/A 3:6070:477EKF 6177554 0.0149 9.6970.977

CKF 4 1000 0.1083 5.2170.600

GP-ADF 2.9370.0143 0.0001 18.270.332

GP-UKF 2.9370.0142 0.0001 18.170.330

TIM 48.872.25 o0:0001 37.271.73

Pendulum (T ¼ 200¼ 80 s and R¼100)

UKF-D 100� 2:61170:0750 o0:0001 10�1

� 5:037UKF-Ln 0:39270:0277 N/A 1:9370:0378EKF 0.66070.0429 o0:0001 1.9870.0429

CKF 1.1270.0500 o0:0001 3.0170.0550

GP-ADF 1.1870.00681 o0:0001 4.3470.0449

GP-UKF 1.7770.0313 o0:0001 5.6770.0714

TIM 0.89670.0115 o0:0001 4.1370.0426

parameters. The TIM treats the data as iid normal and is insertedas a reference point. The GP-UKF and GP-ADF use GPs toapproximate f and g and exploit the properties of GPs to maketractable predictions. The Kitagawa and pendulum dynamicswere used by Deisenroth et al. [15] to illustrate the performanceof the GP-ADF and the very poor performance of the UKF.Deisenroth et al. [15] used the default settings of a¼ 1, b¼ 0,k¼ 2 for all of the experiments; we use k¼ 3�D. We usedexploration trade off K¼2 for the GPO in all the experiments.Additionally, GPO used the squared-exponential with automaticrelevance determination (SE-ARD) covariance function,

kxðyi,yjÞ ¼ s20 expð�1

2ðyi�yjÞ>Mðyi�yjÞÞ, ð17Þ

M :¼ diagð‘Þ�1, ð18Þ

x :¼ fs20,‘gAðRþ Þ9y9þ1

¼ ðRþ Þ4, ð19Þ

plus a noise term with standard deviation 0.01 nats per observa-tion. We set the GPO to have a maximum number of functionevaluations of 100, even better results can be obtained by lettingthe optimizer run longer to hone the parameter estimate. Weshow that by learning appropriate values for y we can match, ifnot exceed, the performance of the GP-ADF and other methods.

The models were evaluated on their one-step-ahead predic-tions. The evaluation metrics were the negative log-predictivelikelihood (NLL), the mean squared error (MSE), and the meanabsolute error (MAE) between the mean of the prediction and thetrue value. Note that unlike the NLL, the MSE and MAE do notaccount for uncertainty. The MAE will be more difficult forapproximate methods than MSE. For MSE, the optimal action isto predict the mean of the predictive distribution, while for theMAE it is the median [18, Chapter 1]. Most approximate methodsattempt to moment match to a Gaussian and preserve the mean;the median of the true predictive distribution is implicitly assumedto be the same as mean. Quantitative results are shown in Table 1.

easures are supplied with 95% confidence intervals and a p-value from a one-sided

s reported in nats per observation, while MSE and MAE are in the units of y2 and y,

means and the data to the complex plane before computing MSE and MAE.

p-Value MAE p-Value

0:0901 o0:0001 10�1� 1:2270:0253 o0:0001

N/A 1:0970:0236 N/A

o0:0001 1.3770.0299 o0:0001

o0:0001 1.1770.0260 o0:0001

o0:0001 1.3070.0261 o0:0001

o0:0001 1.3270.0266 o0:0001

o0:0001 1.7870.0323 o0:0001

:607 o0:0001 100� 1:3270:0841 o0:0001

N/A 1:0570:0692 N/A

o0:0001 1.7570.113 o0:0001

o0:0001 1.3070.0830 o0:0001

o0:0001 4.1070.0522 o0:0001

o0:0001 4.0970.0521 o0:0001

o0:0001 4.5470.179 o0:0001

0:0760 o0:0001 10�1� 10:670:0940 o0:0001

N/A 6.1470.0577 N/A

0.0401 6.1170.0611 0.779

o0:0001 7.7970.0760 o0:0001

o0:0001 10.370.0589 o0:0001

o0:0001 11.670.0857 o0:0001

o0:0001 10.270.0589 o0:0001


6.1. Sinusoidal dynamics

The models were trained on T¼1000 observations from thesinusoidal dynamics, and tested on R¼10 restarts with T¼500points each. The initial state was sampled from a standardnormal x1 �N ð0;1Þ. The UKF optimizer found the optimal valuesa¼ 2:0216, b¼ 0:2434, and k¼ 0:4871.

6.2. Kitagawa

The Kitagawa model has a tendency to stabilize around x¼77where it is linear. The challenging portion for filtering is awayfrom the stable portions where the dynamics are highly non-linear. Deisenroth et al. [15] evaluated the model using R¼200independent starts of the time series allowed to run only T¼1time steps, which we find somewhat unrealistic. Therefore, weallow for T¼10 time steps with R¼200 independent starts. In thisexample, x1 �N ð0;0:52

Þ.The learned value of the parameters were a¼ 0:3846,

b¼ 1:2766, k¼ 2:5830.

6.3. Pendulum

The models were tested on R¼100 runs of length T¼200 each,with x1 �N ð½�p 0�,½0:12 0:22

�Þ. The initial state mean of ½�p 0�corresponds to the pendulum being in the downward position.The models were trained on R¼5 runs of length T¼200. We foundthat in order to perform well on NLL, but not on MSE and MAE,multiple runs of the time series were needed during training;otherwise, TIM had the best NLL. This is because if the time seriesis initialized in one state the model will not have a chance to learnthe needed parameter settings to avoid rare, but still present,sigma point collapse in other parts of the state-space. A shortperiod single sigma point collapse in a long time series can givethe models a worse NLL than even TIM due to incredibly smalllikelihoods. The MSE and MAE losses are more bounded so a shortperiod of poor performance will be hidden by good performanceperiods. Even when R¼1 during training, sigma point collapse ismuch rarer in UKF-L than UKF-D. The UKF found optimal values ofthe parameters to be a¼ 0:5933, b¼ 0:1630, k¼ 0:6391. This isfurther evidence that the correct y is hard to proscribe a priori andmust be learned empirically. We compare the predictions of thedefault and learned settings in Fig. 4.

196 198 200 202 204 206 208 210 212 214

−0.5

0

0.5

1

1.5

2

Time step

x 1

Fig. 4. Comparison of default and learned for one-step-ahead prediction for first eleme

shaded area represent the mean and 95% confidence interval of the predictive distribut

figure legend, the reader is referred to the web version of this article.)

6.4. Analysis of sigma point collapse

We find that the marginal likelihood is extremely unstable inregions of y that experience sigma point collapse. When sigmapoint collapse occurs, the predictive variances become far toosmall making the marginal likelihood much more susceptible tonoise. Hence, the marginal likelihood is smooth near the optima,as seen in Fig. 2. As a diagnostic d for sigma point collapse we lookat the mean 9R9 of the predictive distribution.

6.5. Computational complexity

The UKF-L, UKF, and EKF have test set computationaltime OðDTðD2

þMÞÞ. The GP-UKF and GP-ADF have complexityOðDTðD2

þDMN2ÞÞ, where N is the number of points used in

training to learn f and g. This means that if N2bD then UKF-L

will have a much smaller computation complexity than theGP-ADF, which also attempts to avoid sigma point collapse.Typically it will take much more than N points to approximatea function in D dimensions, which implies that we will almostalways have N2

bD. The D3 term in GP-ADF comes from thecovariance calculations in the GP. This is usually not problematicas D is typically small, unlike T. If a large number of trainingpoints N is needed to approximate f and g well the GP-ADF andGP-UKF can become much slower than the UKF.

6.6. Discussion

The learned parameters of the UKF performed significantlybetter than the default UKF for all error measures and data sets.Likewise, it performed significantly better than all other methodsexcept against the EKF on the pendulum data on MAE, where thetwo methods are essentially tied. We found that results could beimproved further by averaging the predictions of the UKF-L andthe EKF.

There exists another set of parameters rarely addressed in theUKF. The Cholesky factorization is typically used for

ffiffiffiffiRp

. How-ever, any matrix square-root can be used. Meaning, we can applya rotation matrix to the Cholesky factorization R and get a validUKF, which gives us OðD2

Þ extra degrees of freedom (parameters).However, the beauty of the UKF-L is that the number of para-meters to learn is three regardless of D.

196 198 200 202 204 206 208 210 212

−0.5

0

0.5

1

1.5

2

Time step

x 1

nt of yt in the Pendulum model. The red line is the truth, while the black line and

ion. (a) Default y, (b) learned y. (For interpretation of the references to color in this


7. Conclusions

We have presented an automatic and model based mechanismto learn the parameters of a UKF, fa,b,kg, in a principled way. TheUKF can be reinterpreted as a generative process that performsinference on a slightly different NLDS than desired throughspecification of f and g. We demonstrate how the UKF can failarbitrarily badly in very nonlinear problems through sigma pointcollapse. Learning the parameters can make sigma point collapseless likely to occur. When the UKF learns the correct parametersfrom data it can outperform other filters designed to avoid sigmapoint collapse, such as the GP-ADF, on common benchmarkdynamical system problems.

Acknowledgements

We thank Steven Bottone, Zoubin Ghahramani, and AndrewWilson for advice and feedback on this work.

References

[1] R.E. Kalman, A new approach to linear filtering and prediction problems,Trans. ASME—J. Basic Eng. 82 (1960) 35–45.

[2] P.S. Maybeck, Stochastic Models, Estimation, and Control. Mathematics inScience and Engineering, vol. 141, Academic Press, Inc., 1979.

[3] S.J. Julier, J.K. Uhlmann, A new extension of the Kalman filter to nonlinearsystems, in: Proceedings of AeroSense: 11th Symposium on Aerospace/DefenseSensing, Simulation and Controls, Orlando, FL, USA, 1997, pp. 182–193.

[4] I. Arasaratnam, S. Haykin, Cubature Kalman filters, IEEE Trans. Autom.Control 54 (6) (2009) 1254–1269.

[5] S.J. Julier, J.K. Uhlmann, Unscented filtering and nonlinear estimation,Proc. IEEE 92 (3) (2004) 401–422. doi:10.1109/JPROC.2003.823141.

[6] E.A. Wan, R. van der Merwe, The unscented Kalman filter for nonlinearestimation, in: Symposium 2000 on Adaptive Systems for Signal Processing,Communication and Control, IEEE, Lake Louise, AB, Canada, 2000, pp. 153–158.

[7] S. Thrun, W. Burgard, D. Fox, Probabilistic Robotics, The MIT Press, Cam-bridge, MA, USA, 2005.

[8] S.J. Julier, J.K. Uhlrnann, H.F. Durrant-Whyte, A new approach for filteringnonlinear systems. in: American Control Conference, vol. 3, IEEE, Seattle, WA,USA, 1995, pp. 1628–1632. doi:10.1109/ACC.1995.529783.

[9] E. Snelson, Z. Ghahramani, Compact approximations to Bayesian predictivedistributions, in: Proceedings of the 22nd International Conference onMachine Learning, Omnipress, Bonn, Germany, 2005, pp. 840–847.

[10] M.A. Osborne, R. Garnett, S.J. Roberts, Gaussian processes for global optimi-zation, in: 3rd International Conference on Learning and Intelligent Optimi-zation (LION3), Trento, Italy, 2009.

[11] N. Srinivas, A. Krause, S. Kakade, M. Seeger, Gaussian process optimizationin the bandit setting: no regret and experimental design, in: J. Furnkranz,

T. Joachims (Eds.), Proceedings of the 27th International Conference onMachine Learning, Omnipress, Haifa, Israel, 2010, pp. 1015–1022.

[12] C.E. Rasmussen, Z. Ghahramani, Bayesian Monte Carlo, in: S. Becker, S. Thrun,K. Obermayer (Eds.), Advances in Neural Information Processing Systems,vol. 15, The MIT Press, Cambridge, MA, USA, 2003, pp. 489–496.

[13] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, The MITPress, Cambridge, MA, USA, 1998.

[14] R. Turner, M.P. Deisenroth, C.E. Rasmussen, State-space inference and learn-ing with Gaussian processes, in: the 13th International Conference onArtificial Intelligence and Statistics, vol. 9, Sardinia, Italy, 2010, pp. 868–875.

[15] M.P. Deisenroth, M.F. Huber, U.D. Hanebeck, Analytic moment-based Gaus-sian process filtering, in: Proceedings of the 26th International Conference onMachine Learning, Omnipress, Montreal, QC, Canada, 2009, pp. 225–232.

[16] G. Kitagawa, Monte Carlo filter and smoother for non-Gaussian nonlinearstate space models, J. Computat. Graphical Stat. 5 (1) (1996) 1–25.

[17] M.P. Deisenroth, C.E. Rasmussen, J. Peters, Model-based reinforcementlearning with continuous states and actions, in: Proceedings of the 16thEuropean Symposium on Artificial Neural Networks (ESANN 2008), Bruges,Belgium, 2008, pp. 19–24.

[18] C.M. Bishop, Pattern Recognition and Machine Learning, first ed., Springer,2007.

Ryan Turner received his Bachelors of Science degreefrom the University of California, Santa Cruz, CA, USAin Computer Engineering in 2007 (highest honors). Hehas recently completed his Ph.D. in Machine Learningat the University of Cambridge in the Computationaland Biological Learning Lab. His research interests areGaussian processes, state space models, and changepoint detection.

Carl Edward Rasmussen received his Masters degreein Engineering from the Technical University of Den-mark and his Ph.D. degree in Computer Science fromthe University of Toronto in 1996.

Since 1996 he has been a post doc at the TechnicalUniversity of Denmark, a senior research fellowat the Gatsby Computational Neuroscience Unit atUniversity College London from 2000 to 2002 anda junior research group leader at the Max PlanckInstitute for Biological Cybernetics in Tubingen,Germany, from 2002 to 2007. He is a reader in theComputational and Biological Learning Lab at the
Department of Engineering, University of Cambridge,
Germany. His main research interests are Bayesian inference and machinelearning.

dx.doi.org/10.1109/JPROC.2003.823141

dx.doi.org/10.1109/ACC.1995.529783

model based learning of sigma points in unscented kalman filtering

Documents