classic estimation - ulisboausers.isr.ist.utl.pt/~jsm/teaching/ec/2.pdf · © jorge salvador...
TRANSCRIPT
© Jorge Salvador Marques, 2002
Classic Estimation
© Jorge Salvador Marques, 2002
Summary
•Motivation
•Deterministic Methods
•Classic Probabilistic Methods
•Examples
© Jorge Salvador Marques, 2002
ChallengeGiven the data points shown in the figure we wish to approximatethem by a 2nd order polynomial
y=c2 x2+ c1 x+ c0
Problem: how to compute c0, c1, c2.
-1 -0.5 0 0.5 10
1
2
3
x
y
Note: we assume that the x values are accurately known but the y values are corrupted by measurement noise.
© Jorge Salvador Marques, 2002
Gauss (1777-1855)
© Jorge Salvador Marques, 2002
Parabolic Fit
The absolute value of the errors must be small
ei - error
yi
Quadratic cost:2i
ieE ∑= <e,e>
© Jorge Salvador Marques, 2002
Estimation of the coefficients
What coefficients minimize E ?
Stationarity condition:
0)(220 012
22 =−−−∑=∑=∑
∂∂
⇒=∂∂ p
iiiii
pii
ii
ippxcxcxcyxee
ccE
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
⎥⎥
⎦
⎤
⎢⎢
⎣
⎡
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
∑
∑
∑
=
∑∑∑
∑∑∑
∑∑∑
iixiy
iixiy
iiy
ccc
iixix
iixix
iix
iixix
iixix
iix
iix
iix
i
2210
22212
21
2111
internal products between basis functions
internal products between the observations and the basis functions
© Jorge Salvador Marques, 2002
Results
-1 -0.5 0 0.5 10
1
2
3
© Jorge Salvador Marques, 2002
Least Squares Method
Least squares method:
|||| e
cost
Given a set of observations (xi, yi). We wish to approximate yi by φ(xi,θ), θ being an unknown vector of parameters. The error is
ei = yi − φ(xi,θ)
How to obtain θ ?
Hypothesis: it is assumed that xi is accurately known.
2|||| minargˆ ii
e∑=θ
θ
normEuclidean theis||.||
© Jorge Salvador Marques, 2002
Linear Model
iii exy += θφ )'(Model:
LS estimate: bA 1−=θ
)(
)'()(
1
1
iin
i
iin
i
yxb
xxA
φ
φφ
∑=
∑=
=
=
© Jorge Salvador Marques, 2002
Proof∑ −−=i
iiii xyxyE ))'(()')'(( θφθφ
∑=∑⇒=∑ −−⇒=i
iii
iii
iii xxyxxyxddE θφφφθφφ
θ)'()()( 0))'()((2 0
Computing the derivative with respect to θ and using the appropriate properties
is a stationary point.
If the matrix
∑=i
ii xxd
Ed )'()(2 2
2φφ
θis positive definite,
Ε is minimized by .
bA 1ˆ −=θ
θ
© Jorge Salvador Marques, 2002
Geometric InterpretationObservations y1, ..., yn, belong to a vector space E of dimension nm, in which m is the dimension of vector yi.
Model sequence φ(x1)’θ, ..., φ(xn)’θ, belongs to a subspace S ⊂ E with a lower dimension.
The least squares method computes the coefficients of the orthogonal projection of y onto the subspace S, using the internal product:
iii
yxyx ', ∑>=<
y projection is such that the projection error is orthogonal to all the vectors in S:
<e,b>=0, ∀ b∈ S (orthogonality condition)
© Jorge Salvador Marques, 2002
Geometric Interpretationy
e
y
Space of signals generated by the model
If it is a linear subspace S (linear model), is the orthogonal projection ofy onto S
e
y
y Orthogonality conditions
<e,b>=0
b – any vector of S
y
E
© Jorge Salvador Marques, 2002
Learning vs InferenceThe least squares method allows the solution of several problems.
? inference
Learning(model identification)
training data
it does not provide an uncertainty measure!
© Jorge Salvador Marques, 2002
Estimation
yx
General problem: how to estimate x, θ, given y ?
Subproblems: obtain x from y, θ (inference)obtain θ from y, x (learning, identification)
y depends on x and on a vector θx is known or unknown
The learning problem is often based on observations obtained at different instants of time or even different experiments.The estimation of the unknown variable x is usually performed in each experiment.
© Jorge Salvador Marques, 2002
Regression
Given (xi ,yi ), we wish to approximate yi by f(xi ,θ) with θ unknown.
Least squares estimate:
2|||| minargˆ ii
e∑=θ
θ ),( θiii xfye −=
Examples: linear combination of basis functions, neural networks.
x
y
Hypothesis (asymmetric): x is accurately known.
© Jorge Salvador Marques, 2002
PredictionGiven y=( y1 , ..., yt-1 ), we wish to predict yt using a linear combination of past observations:
y1 = at yt-1 + ... + ap yt-p + et
The coefficients ai are estimated by least squares.
?
© Jorge Salvador Marques, 2002
Frequency EstimationGiven a model
iii etAy ++= )cos( ϕω
We wish to estimate A,ω,ϕ ?
Available data: (ti, yi) i=1, ...,N
Idea: choose A,ω,ϕ minimizing ∑=i
ieE 2
non linear problem
© Jorge Salvador Marques, 2002
Estimation of a Sinusoid
with noise
0 1 2 3 4 5 6 7-2
-1
0
1
2
3
0 1 2 3 4 5 6 7-2
-1
0
1
2
0 1 2 3 4 5 6 7-4
-3
-2
-1
0
1
2
3
4
0 1 2 3 4 5 6 7-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
without noise
3 points 7 points
Optimization in the interval: [2,0][10,0][5,0]),,( πφω ××∈A(white noise with σ=.5)
© Jorge Salvador Marques, 2002
Restrictions
The least squares method has the following restrictions:
• does not allow a statistical description of the error: nothing is known about the error we obtain in the next experiment.
•Not robust
•Leads to difficult optimization problems when non linear models are used. In general, only local minima of E can be obtained.
© Jorge Salvador Marques, 2002
Robustnessspurious observations
Not robust
Alternatives:• weighted LS (errors are weighted by confidence degrees)• robust estimation methods
(why ?)
© Jorge Salvador Marques, 2002
RANSAC
•choose a minimal set of data allowing to estimate the parameters•compute the number of observations well approximated by the model•repeat the previous steps N times and at the end choose the estimate with bigger support(the estimate is refined using the support observations)
Refinement
Procedure
support=2 support=3 support=5
Generation of random hypothesis
© Jorge Salvador Marques, 2002
Robust Methods
||)(|| minargˆ iix
ex ρ∑=
Robust estimator:
|||| e
ρ
•requires recursive numeric optimization
Example: LMed
© Jorge Salvador Marques, 2002
Exercises1. Given a signal y=( y1 , ..., yN ), determine the coefficients of the linear predictor
N>>p
using the least squares method.
2. Generalize the previous problem to the case in which we want to predict the signal k steps ahead.
ptptt yayay −− ++= ...ˆ 11
© Jorge Salvador Marques, 2002
WorkConsider two parabola the and observations belonging to each of them. However, we do not know which parabola is close to each observation. Estimate the coefficients of the parabola using the least squares method and robust methods.
Characterize the performance of both methods
© Jorge Salvador Marques, 2002
Classic Probabilistic Methods
© Jorge Salvador Marques, 2002
Classic vs Bayesian Methods
yx y – realization of a random variable
•classic: LS, EM•Bayesian: MAP, MSE
Estimation methods:
Classic methods consider x as a deterministic variable.
Bayesian methods consider x as a random variable characterized by an a posteriori distribution p(x/y)
x - unknown
© Jorge Salvador Marques, 2002
Classic Estimation
•classic estimation methods are not based on general principles.
• An estimator is a mapθ(y) from the observation space into the parameter space.
• Each estimator is a random variable which can be statistically characterized. In general we wish to define estimators with a set of desired properties (e.g., unbiased and with low variance).
© Jorge Salvador Marques, 2002
Maximum Likelihood Method
Maximum likelihood estimate (ML)
p(y/x)l(x)xlx
p(y/x)L(x)xLx
x
x
log )( maxargˆ
)( maxargˆ
==
==
(Fisher, 1921)
If y= y1,..., yN is a set of independent and identically distributed (iid) observations, then
∏=i
i xypxL )/()( ∑=i
i xypxl )/(log)(
oulikelihood function
log likelihood function
© Jorge Salvador Marques, 2002
Normal Distribution
Problem: estimate the mean and covariance matrix of a normal distributionN(µ,R) from a set of n independent random variables y1, ..., yn.
Log likelihood function:
)()'(||log),( 121
2 µµµ −−∑−−= −ii
in yRyRKRl
Optimizing l we get
)'ˆ)(ˆ(ˆ ˆ ii1
1i
11 µµµ −−∑=∑=
==yyPy
n
in
n
in
Mahalanobis distance
© Jorge Salvador Marques, 2002
Properties of ML estimates
)ˆ(ˆ then )( if MVMV yfzyfz ==•Invariance to nonlinear transformations of the data:
In the sequel we consider that we observed a sequence of n iid variables such that the density p of each variable belongs to the family of functions adopted to the model, i.e. p=px0
, where xo is the correct value of the parameter x.• the ML estimator is consistent ( converges to xo when n tends to infinity)• has a normal asymptotic distribution:
x
),0()ˆ( JNxxnd
o →−
where J is the Fisher information matrix:
}{T
dxdl
dxdlEJ =
© Jorge Salvador Marques, 2002
ApproximationHow does the ML method work if the model is wrong ?
PropositionLet y be a sequence of n iid variables with density p and {pθ} a family of functions depending on x (model). If the ML estimates converge to when n tends to infinity, then minimizes the Kullback-Leiblerdivergence between p and {pθ} (model):
dyyp
ypypppDx
x )()(log)(],[ ∫=
x
The divergence is a non symmetric distance.
p
px
D
x
© Jorge Salvador Marques, 2002
ML vs LS
When the observations are normally distributed
)()/( θθ ECeyp −=
The maximum likelihood method is equivalent to the least squares method.
© Jorge Salvador Marques, 2002
Polynomial Approximation
Model:
Data: (x1, y1), (x2, y2), ..., (xn, yn)
),0(~ ]1[)'( )'( 22 σφθφ Nw x xxwxy iiiiii =+=
)(])'([)( 22
12 θθφθ
σECxyCl ii
i−=−∑−=
Log likelihood function:
where E is the least squares energy. The maximum likelihood estimateis equal to the least squares estimate.
Did we gain anything ? Yes. can be statistically characterized.θ
bA 1−=θ )( )'()(11
iin
iii
n
iyxbxxA φφφ ∑=∑=
==
© Jorge Salvador Marques, 2002
Example – Frequency estimation
0 0.5 1 1.5 2 2.5 3-2
0
2
Energy Likelihood function
0 2 4 6 8 100
2
4
0 2 4 6 8 100
1
2
n=1
0 2 4 6 8 100
2
4
0 2 4 6 8 100
1
2
n=2
0 2 4 6 8 100
5
10
0 2 4 6 8 100
10
20
Alln=3
ω ω
observations rad/segNiw
iwitiy
1 )01,.0(~
)cos(
=
+=
ω
ω
rad/seg1ˆ =ω
© Jorge Salvador Marques, 2002
Binary Detection
A binary symbol was transmitted in an analog channel in base band.
},{ AA−∈θ
At the receiver, the observed signal is
,...,n iwy ii 1 =+= θWe wish to estimate θ assuming that the noise is uncorrelated and
),0(~ 2σNwi
The log likelihood function is
θθθσσ i
ii
iyCyl ∑+=−∑−= 22
122
1 )()(
This is the expression of the matched filter.
© Jorge Salvador Marques, 2002
ProofMean estimation:
0)(2)()'(0 11 =−∑−=−−∑∂∂
⇒=∂∂ −− µµµ
µµxRxRxl
ii
xin ∑= 1µ
multiplying by R we get
Note: see the derivative rules presented in Lecture 1
Covariance estimation :
0)()'(||log0 121
2 =−−∂∂
∑−∂∂−⇒=
∂∂ − µµ ii
i
n yRyR
RRR
l
0)')(( 111 =−−∑+− −−− RyyRnR iii
µµ
Multipying by R on the left and on the right
)')((ˆ 1 µµ −−∑= iiin yyR
© Jorge Salvador Marques, 2002
Linear PredictionPrediction is a regression problm. In linear prediction the regression vector consists of the past values of the signal.
ipipii vyayay +++= −− ...11
),0(~ ' 2 σθφ Nvvy iiii += ]'...[ ]'...[ 11 piiip yyaa −−== φθ
The solution of this problem is similar to the estimation of parameters with a linear model (see polynomial approximation)
i=p+1, ...,n
© Jorge Salvador Marques, 2002
Characterization of Estimators
An estimator is a random variable which depends on x, y.
2nd order properties
Mean
random }ˆ{}{ticdeterminis }ˆ{
xxExEBxxExB
−=−=
}ˆ{xE=µ
Covariance matrix
})ˆ)(ˆ{( TxxER µµ −−=
Bias
Notes:•ideal goal: B=0 e zero covariance. This goal is impossible in most problems.•Expected values are computed using p(y/x), when x is deterministic and with p(x,y) when x is random.
© Jorge Salvador Marques, 2002
Crámer-Rao boundCan the covariance matrix of an estimator be the null matrix ?
Crámer-Rao theorem
1}ˆ{ −≥ JxCov
J -1 Crámer-Rao bound, J is the Fisher information matrix
} {T
dxdl
dxdlEJ = l log likelihood function
When The estimator is called eficiente.1}ˆ{ −= JxCov
(x determministic)
definition: A>B if and only if A-B positive definite.
© Jorge Salvador Marques, 2002
ExampleIn practice it is not always easy to compute the Crámer-Rao bound. Onecase which can be easily addressed concerns the estimation of the meanvector of a normal distribution, given N observations y1,..., yN.
RCov N1}ˆ{ ≥µ
Proof: since p(y/x)=N(µ,R),
∑ −−−= −
ii
Ti yRyCl )()()( 1
21 µµµ ∑ −= −
iid
dl yR )(1 µµ
11111 })()({}{ −−−−− ==∑ −∑ −== NRRNRRRyyEREJj
ji
iT
ddl
ddl µµµµ
RJCRB N11 == −
Note: show that the ML estimator of µ is efficient.
© Jorge Salvador Marques, 2002
Monte Carlo MethodMonte Carlo Method: Numeric evaluation of an estimator based on a large number of estimation experiments and statistical analysis of the results.
© Jorge Salvador Marques, 2002
Exercises1. Let x be a binary variable such that P(0)=Po, where Po is an unknown
parameter. Determine the ML estimate of Po from n independent realizations of x.
2. Determine the ML estimator of the mean and covariance matrix of a normal distribution N(µ,R), given n independent observations. (see Apendix).
3. Given n samples of a random signal y described by an autoregressive model yt= a yt-1+ b wt, where w is a white process with distribution wt~N(0,1), determine coefficients a, b by the ML method.
4. Compute the Crámer-Rao bound for the estimation of the parameter ofa Rayleigh distribution, knowing N realizations y1, ..., yN.
© Jorge Salvador Marques, 2002
Appendix – Matrices (I)
Transpose: C=A’Sum: C=A+BMultiplication: C=ABTrace: c=tr(A)Inversion: C=A-1
eigen values and eigen vectors: iiiii
k ii
k kjikij
ijijij
ijij
vAvvIAAAAAC
acAtrcbacABC
bacBACacAC
λλ ====
∑==∑==
+=+===
−−−
,
)(
'
111
Properties
• a matrix is non sigular if all its eigen values are non zero;• the rank of a matrix is equal to the number of eigen values different fromzero; a non singular matrix has full rank.•The eigen values of a symmetric matrix are real.
Operations
© Jorge Salvador Marques, 2002
Appendix Matrices (II)• the trace of a matrix is equal to the summ of all its eigen values;• the determinant of a matrix is equal to the product of all its eigen values• the inverse of a matrix has the following properties:
111)( −−− = ABAB
22122111
1
][ ][][ ][ que em nnDnnC
nnBnnAHGFE
DCBA
×=×=×=×=
⎥⎦⎤
⎢⎣⎡=⎥⎦
⎤⎢⎣⎡
−
111
1
1
11 )(
−−−
−
−
−−
+=
−=
−=
−=
CEBDDDH
CEDG
EBDF
CBDAE
© Jorge Salvador Marques, 2002
DerivativesThe derivative of a scalar function f(X) with respect to matrix X is a matrix
[ ]ijXd
XdfijdX
Xdf][
)()( =
Se f=f(X,Y) e Y=y(X), [ ] dYdf
dXdY
dXdf
dXdf '
+=
Properties
)'11(}1{
'''}'{
''''''}{
2}'{
}'{
''}{
}'{
'}{
−−−=−+=
+=
=
=
=
=
=
BAXXBAXtrdXd
AXBBXAAXBXtrdXd
AXBBXAAXBXtrdXd
XXXtrdXd
BABAXtrdXd
BAAXBtrdXd
AAXtrdXd
AAXtrdXd
)'(||||
)'(||||
)'(|log|
)'(||||
1
1
1
1
−
−
−
−
=
=
=
=
XXnX
XAXBAXB
XX
XXXdXd
nndXd
dXd
dXd
trace determinant
Note: extracted from Sage&Melsa
© Jorge Salvador Marques, 2002
BibliographyDuda, Hart, Stork, Pattern Classification, Wiley, 2001.
J. Marques, Reconhecimento de Padrões. Métodos Estatísticos e Neuronais, IST Press, 1999