alternatives to least-squares - ubccourses.ece.ubc.ca/574/ident2.pdfalternatives to least-squares...
TRANSCRIPT
Alternatives to Least-Squares
• Need a method that gives consistent estimates in presence of colourednoise
• Generalized Least-Squares
• Instrumental Variable Method
• Maximum Likelihood Identification
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 61
Generalized Least Squares
• Noise model
w(t) =e(t)
C(q−1)where {e(t)} = N(0, σ) and C(q−1) is monic and of degree n.
• System described as
A(q−1)y(t) = B(q−1)u(t) +1
C(q−1)e(t)
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 62
Generalized Least Squares
• Defining filtered sequences
y(t) = C(q−1)y(t)
u(t) = C(q−1)u(t)
• System becomes
A(q−1)y(t) = B(q−1)u(t) + e(t)
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 63
Generalized Least Squares
• If C(q−1) is known, then least-squares gives consistent estimates of Aand B, given y and u
• The problem however, is that in practice C(q−1) is not known and {y}and {u} cannot be obtained.
• An iterative method proposed by Clarke (1967) solves that problem
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 64
Generalized Least Squares
1. Set C(q−1) = 1
2. Compute {y(t)} and {u(t)} for t = 1, . . . , N
3. Use least–squares method to estimate A and B from y and u
4. Compute the residuals
w(t) = A(q−1
)y(t)− B(q−1
)u(t)
5. Use least-squares method to estimate C from
C(q−1
)w(t) = ε(t)
i.e.
w(t) = −c1w(t− 1)− c2w(t− 2)− · · ·+ ε(t)
where {ε(t)} is white
6. If converged, then stop, otherwise repeat from step 2.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 65
Generalized Least Squares
• For convergence, the loss function and/or the whiteness of the residualscan be tested.
• An advantage of GLS is that not only it may give consistent estimatesof the deterministic part of the system, but also gives a representationof the noise that the LS method does not give.
• The consistency of GLS depends on the signal to noise ratio, theprobability of consistent estimation increasing with the S/N.
• There is, however no guarantee of obtaining consistent estimates
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 66
Instrumental Variable Method (IV) 4
• As seen previously, the LS estimate
θ = [XTX]−1XTY
is unbiased if W is independent of X.
• Assume that a matrix V is available, which is correlated with X but notwith W and such that V TX is positive definite, i.e.
E[V TX] is nonsingular
E[V TW ] = 0
4T. Soderstrom and P.G. Stoica,Instrumental Variable Methods for System Identification, Spinger-Verlag,Berlin, 1983
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 67
Instrumental Variable Method (IV)
• Then,V TY = V TXθ + V TW
and θ estimated by
θ = [V TX]−1V TY
V is called the instrumental variable matrix.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 68
Instrumental Variable Method (IV)
• Ideally V is the noise-free process output and the IV estimate θ isconsistent.
• There are many possible ways to construct the instrumental variable. Forinstance, it may be built using an initial least–squares estimates:
A1y(t) = B1u(t)
and the kth row of V is given by
vTk = [−y(k − 1), . . . ,−y(k − n), u(k − 1), . . . , u(k − n)]
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 69
Instrumental Variable Method (IV)
• Consistent estimation cannot be guaranteed in general.
• A two-tier method in its off–line version, the IV method is more usefulin its recursive form.
• Use of instrumental variable in closed-loop.
– Often the instrumental variable is constructed from the input sequence.This cannot be done in closed-loop, as the input is formed from oldoutputs, and hence is correlated with the noise w, unless w is white.In that situation, the following choices are available.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 70
Instrumental Variable Method (IV)
• Instrumental variables in closed-loop
– Delayed inputs and outputs. If the noise w is assumed to be a movingaverage of order n, then choosing v(t) = x(t− d) with d > n gives aninstrument uncorrelated with the noise. This, however will only workwith a time-varying regulator.
– Reference signals. Building the instruments from the setpoint willsatisfy the noise independence condition. However, the setpoint mustbe a sufficiently rich signal for the estimates to converge.
– External signal. This is effect relates to the closed-loop identifiabilitycondition (covered a bit later...). A typical external signal is a whitenoise independent of w(t).
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 71
Maximum-likelihood identification
The maximum-likelihood method considers the ARMAX model belowwhere u is the input, y the output and e is zero-mean white noise withstandard deviation σ:
A(q−1)y(t) = B(q−1)u(t− k) + C(q−1)e(t)
whereA(q−1) = 1 + a1q
−1 + · · ·+ anq−n
B(q−1) = b1q−1 + · · ·+ bnq−n
C(q−1) = 1 + c1q−1 + · · ·+ cnq−n
The parameters of A, B, C as well as σ are unknown.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 72
Maximum-likelihood identificationDefining
θT
= [ a1 · · · an b1 · · · bn c1 · · · cn]
xT
= [ −y(t) · · · u(t− k) · · · e(t− 1) · · · ]
the ARMAX model can be written as
y(t) = xT (t)θ + e(t)
Unfortunately, one cannot use the least-squares method on this model sincethe sequence e(t) is unknown.
In case of known parameters, the past values of e(t) can be reconstructedexactly from the sequence:
ε(t) = [A(q−1)y(t)−B(q−1)]/C(q−1)
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 73
Maximum-likelihood identification
Defining the performance index below, the maximum-likelihood method is then
summarized by the following steps.
V =1
2
NXt=1
ε2(t)
• Minimize V with respect to θ, using for instance a Newton-Raphson algorithm. Note
that ε is linear in the parameters of A and B but not in those of C. We then have to
use some iterative procedure
θi+1 = θi − αi(V′′(θi))
−1V′(θi)
• Initial estimate θ0 is usually obtained from a least-squares estimate
• Estimate the noise variance as
σ2=
2
NV (θ)
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 74
Properties of the Maximum-likelihood Estimate (MLE)
• If the model order is sufficient, the MLE is consistent, i.e. θ → θ asN →∞.
• The MLE is asymptotically normal with mean θ and standard deviationσθ.
• The MLE is asymptotically efficient, i.e. there is no other unbiasedestimator giving a smaller σθ.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 75
Properties of the Maximum-likelihood Estimate (MLE)
• The Cramer-Rao inequality says that there is a lower limit on the precisionof an unbiased estimate, given by
cov θ ≥ M−1θ
where Mθ is the Fisher Information Matrix Mθ = −E[(log L)θθ] For theMLE
σ2θ = M−1
θ
i.e.σ2
θ = σ2V −1θθ
if σ is estimated, then
σ2θ =
2V
NV −1
θθ
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 76
Identification In Practice
1. Specify a model structure
2. Compute the best model in this structure
3. Evaluate the properties of this model
4. Test a new structure, go to step 1
5. Stop when satisfactory model is obtained
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 77
MATLAB System Identification Toolbox
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 78
MATLAB System Identification Toolbox
• Most used package
• Graphical User Interface
• Automates all the steps
• Easy to use
• Familiarize yourself with it by running the examples
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 79
RECURSIVE IDENTIFICATION
• There are many situations when it is preferable to perform theidentification on-line, such as in adaptive control.
• Identification methods need to be implemented in a recursive fashion,i.e. the parameter estimate at time t should be computed as a functionof the estimate at time t− 1 and of the incoming information at time t.
• Recursive least-squares
• Recursive instrumental variables
• Recursive extended least-squares and recursive maximum likelihood
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 80
Recursive Least-Squares (RLS)
We have seen that, with t observations available, the least-squaresestimate is
θ(t) = [XT (t)X(t)]−1XT (t)Y (t)
withY T (t) = [ y(1) · · · y(t)]
X(t) =
xT (1)...
xT (t)
Assume one additional observation becomes available, the problem is thento find θ(t + 1) as a function of θ(t) and y(t + 1) and u(t + 1).
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 81
Recursive Least-Squares (RLS)
Defining X(t + 1) and Y (t + 1) as
X(t + 1) =[
X(t)xT (t + 1)
]Y (t + 1) =
[Y (t)
y(t + 1)
]and defining P (t) and P (t + 1) as
P (t) = [XT (t)X(t)]−1 P (t + 1) = [XT (t + 1)X(t + 1)]−1
one can write
P (t + 1) = [XT (t)X(t) + x(t + 1)xT (t + 1)]−1
θ(t + 1) = P (t + 1)[XT (t)Y (t) + x(t + 1)y(t + 1)]
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 82
Matrix Inversion Lemma
Let A, D and [D−1 + CA−1B] be nonsingular square matrices. ThenA + BDC is invertible and
(A + BDC)−1 = A−1 −A−1B(D−1 + CA−1B)−1CA−1
Proof The simplest way to prove it is by direct multiplication
(A + BDC)(A−1 − A
−1B(D
−1+ CA
−1B)−1
CA−1
)
= I + BDCA−1 − B(D
−1+ CA
−1B)−1
CA−1
−BDCA−1
B(D−1
+ CA−1
B)−1
CA−1
= I + BDCA−1 − BD(D
−1+ CA
−1B)(D
−1+ CA
−1B)−1
CA−1
= I
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 83
Matrix Inversion Lemma
An alternative form, useful for deriving recursive least-squares is obtainedwhen B and C are n× 1 and 1× n (i.e. column and row vectors):
(A + BC)−1 = A−1 − A−1BCA−1
1 + CA−1B
Now, consider
P (t + 1) = [XT (t)X(t) + x(t + 1)xT (t + 1)]−1
and use the matrix-inversion lemma with
A = XT (t)X(t) B = x(t + 1) C = xT (t + 1)
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 84
Recursive Least-Squares (RLS)
Some simple matrix manipulations then give the recursive least-squaresalgorithm:
θ(t + 1) = θ(t) + K(t + 1)[y(t + 1)− xT(t + 1)θ(t)]
K(t + 1) =P (t)x(t + 1)
1 + xT (t + 1)P (t)x(t + 1)
P (t + 1) = P (t)−P (t)x(t + 1)xT (t + 1)P (t)
1 + xT (t + 1)P (t)x(t + 1)
Note that K(t + 1) can also be expressed as
K(t + 1) = P (t + 1)x(t + 1)
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 85
Recursive Least-Squares (RLS)
• The recursive least-squares algorithm is the exact mathematical equivalent of the batch
least-squares.
• Once initialized, no matrix inversion is needed
• Matrices stay the same size all the time
• Computationally very efficient
• P is proportional to the covariance matrix of the estimate, and is thus called the
covariance matrix.
• The algorithm has to be initialized with θ(0) and P (0). Generally, P (0) is initialized
as αI where I is the identity matrix and α is a large positive number. The larger α,
the less confidence is put in the initial estimate θ(0).
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 86
RLS and Kalman Filter
There are some very strong connections between the recursive least-squares algorithm and the Kalman filter. Indeed, the RLS algorithm has thestructure of a Kalman filter:
θ(t + 1)︸ ︷︷ ︸new
= θ(t)︸︷︷︸old
+K(t + 1) [y(t + 1)− xT (t + 1)θ(t)]︸ ︷︷ ︸correction
where K(t + 1) is the Kalman gain.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 87
RLS
The following Matlab code is a straightforward implementation of theRLS algorithm:function [thetaest,P]=rls(y,x,thetaest,P)
% RLS
% y,x: current measurement and regressor
% thetaest, P: parameter estimates and covariance matrix
K= P*x/(1+x’*P*x); % Gain
P= P- (P*x*x’*P)/(1+x’*P*x); % Covariance matrix update
thetaest= thetaest +K*(y-x’*thetaest); %Parameter estimate update
% end
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 88
Recursive Extended Least-Squares and RecursiveMaximum-Likelihood
Because the prediction error is not linear in the C–parameters, it is not possible to an
exact recursive maximum likelihood method as for the least–squares method.
The ARMAX model
A(q−1
)y(t) = B(q−1
)u(t) + C(q−1
)e(t)
can be written as
y(t) = xT(t)θ + e(t)
with
θ = [a1, . . . , an, b1, . . . , bn, c1, . . . , cn]T
xT(t) = [−y(t− 1), . . . ,−y(t− n), u(t− 1),
. . . , u(t− n), e(t− 1), . . . , e(t− n)]T
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 89
Recursive Extended Least-Squares and ApproximateMaximum-Likelihood
• If e(t) was known, RLS could be used to estimate θ, however it isunknown and thus has to be estimated.
• It can be done in two ways, either using the prediction error or theresidual.
• The first case corresponds to the RELS method, the second to the AMLmethod.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 90
Recursive Extended Least-Squares and ApproximateMaximum-Likelihood
• The one–step ahead prediction error is defined as
ε(t) = y(t)− y(t | t− 1)
= y(t)− xT (t)θ(t− 1)
x(t) = [−y(t− 1), . . . , u(t− 1), . . . , ε(t− 1), . . . , ε(t− n)]T
• The residual is defined as
η(t) = y(t)− y(t | t)
= y(t)− xT (t)θ(t)
x(t) = [−y(t− 1), . . . , u(t− 1), . . . , η(t− 1), . . . , η(t− n)]T
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 91
Recursive Extended Least-Squares and ApproximateMaximum-Likelihood
• Sometimes ε(t) and η(t) are also referred to as a–priori and a–posterioriprediction errors.
• Because it uses the latest estimate θ(t), as opposed to θ(t− 1) for ε(t),η(t) is a better estimate, especially in transient behaviour.
• Note however that if θ(t) converges as t −→∞ then η(t) −→ ε(t).
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 92
Recursive Extended Least-Squares and ApproximateMaximum-Likelihood
The two schemes are then described by
θ(t + 1) = θ(t) + K(t + 1)[y(t + 1)− xT(t + 1)θ(t)]
K(t + 1) = P (t + 1)x(t + 1)/[1 + xT(t + 1)P (t)x(t + 1)]
P (t + 1) = P (t)−P (t)x(t + 1)xT (t + 1)P (t)
[1 + xT (t + 1)P (t)x(t + 1)]
but differ by their definition of x(t)
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 93
Recursive Extended Least-Squares and ApproximateMaximum-Likelihood
• The RELS algorithm corresponds uses the prediction error. This algorithmis called RELS, Extended Matrix or RML1 in the literature. It hasgenerally good convergence properties, and has been proved consistentfor moving–average and first–order auto regressive processes. However,counterexamples to general convergence exist, see for example Ljung(1975).
• The AML algorithm uses the residual error. The AML has betterconvergence properties than the RML, and indeed convergence can beproven under rather unrestrictive conditions.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 94
Recursive Maximum-Likelihood
• Yet another approach.
• The ML can also be interpreted in terms of data filtering. Consider theperformance index:
V (t) =12
t∑i=1
ε2(i)
with ε(t) = y(t)− xT (t)θ(t− 1)
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 95
Recursive Maximum-Likelihood
• Because V is a nonlinear function of C, it has to be approximated by aTaylor series truncated after the second term. The resulting scheme isthen:
θ(t + 1) = θ(t) + K(t + 1)[y(t + 1)− xT(t + 1)θ(t)]
K(t + 1) =P (t + 1)xf(t + 1)
[1 + xTf (t + 1)P (t)xf(t + 1)]
P (t + 1) = P (t)−P (t)xf(t + 1)xT
f (t + 1)P (t)
[1 + xTf (t + 1)P (t)xf(t + 1)]
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 96
Properties of AML and RML
Definition 1. A discrete transfer function is said to be strictly positive realif it is stable and
ReH(ejw) > 0 ∀w − π < w ≤ π
on the unit circle.
This condition can be checked by replacing z by 1+jw1−jw and extracting the
real part of the resulting expression.
For the convergence of AML, the following theorem is then available.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 97
Properties of AML and RML
Theorem. [Ljung & Soderstrom, 1983)] Assume both process andmodel are described by ARMAX with order model ≥ order process, then if
1. {u(t)} is sufficiently rich
2. 1C(q−1)
− 12$isstrictlypositivereal then θ(t) will converge such that
E[ε(t, θ)− e(t)]2 = 0
If model and process have the same order, this implies
θ(t) −→ θ as t −→∞
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 98
A Unified Algorithm
Looking at all the previous algorithms, it is obvious that they all havethe same form, with only different parameters. They can all be representedby a recursive prediction - error method (RPEM).
θ(t + 1) = θ(t) + K(t + 1)ε(t + 1)
K(t + 1) = P (t)z(t + 1)/[1 + xT (t + 1)P (t)z(t + 1)]
P (t + 1) = P (t)− P (t)z(t + 1)xT (t + 1)P (t)[1 + xT (t + 1)P (t)x(t + 1)]
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 99
Tracking Time-Varying ParametersAll previous methods use the least–squares criterion
V (t) =1t
t∑i=1
[y(i)− xT (i)θ]2
and thus identify the average behaviour of the process. When the parametersare time varying, it is desirable to base the identification on the most recentdata rather than on the old one, not representative of the process anymore.This can be achieved by exponential discounting of old data, using thecriterion
V (t) =1t
t∑i=1
λt−i[y(i)− xT (i)θ]2
where 0 < λ ≤ is called the forgetting factor.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 100
Tracking Time-Varying Parameters
The new criterion can also be written
V (t) = λV (t− 1) + [y(t)− xT (t)θ]2
Then, it can be shown (Goodwin and Payne, 1977) that the RLS schemebecomes
θ(t + 1) = θ(t) + K(t + 1)[y(t + 1)− xT(t + 1)θ(t)]
K(t + 1) = P (t)x(t + 1)/[λ + xT(t + 1)P (t)x(t + 1)]
P (t + 1) =
(P (t)−
P (t)x(t + 1)xT (t + 1)P (t)
[λ + xT (t + 1)P (t)x(t + 1)]
)1
λ
In choosing λ, one has to compromise between fast tracking and long termquality of the estimates. The use of the forgetting may give rise to problems.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 101
Tracking Time-Varying Parameters
The smaller λ is, the faster the algorithm can track, but the more theestimates will vary, even the true parameters are time-invariant.A small λ may also cause blowup of the covariance matrix P , since inthe absence of excitation, covariance matrix update equation essentiallybecomes
P (t + 1) =1λP (t)
in which case P grows exponentially, leading to wild fluctuations in theparameter estimates.One way around this is to vary the forgetting factor according to theprediction error ε as in
λ(t) = 1− kε2(t)
Then, in case of low excitation ε will be small and λ will be close to 1. Incase of large prediction errors, λ will decrease.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 102
Exponential Forgetting and Resetting AlgorithmThe following scheme due Salgado, Goodwin and Middleton5 is
recommended:
ε(t + 1) = y(t + 1)− xT (t + 1)θ(t)
θ(t + 1) = θT (t) +αP (t)x(t + 1)
λ + xT (t + 1)P (t)x(k + 1)ε(t)
P (t + 1) =1λ
[P (t)− P (t)x(t + 1)xT (t + 1)P (t)
λ + x(t + 1)TP (t)x(t + 1)
]+βI − γP (t)2
where I is the identity matrix, and α, β and γ are constants.5M.E. Salgado, G.C. Goodwin, and R.H. Middleton, “Exponential Forgetting and Resetting”, International
Journal of Control, vol. 47, no. 2, pp. 477–485, 1988.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 103
Exponential Forgetting and Resetting Algorithm
With the EFRA, the covariance matrix is bounded on both sides:
σminI ≤ P (t) ≤ σmaxI ∀t
where
σmin ≈β
α− ησmax ≈
η
γ+
β
η
with
η =1− λ
λWith α = 0.5, β = γ = 0.005 and λ = 0.95, σmin = 0.01 and σmax = 10.
Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 104