statistical foundations of machine learning. parametric estimation.pdf · statistical foundations...

51
Statistical foundations of machine learning INFO-F-422 Gianluca Bontempi Département d’Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Machine learning – p. 1/51

Upload: trinhduong

Post on 07-Jul-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Statistical foundations of machinelearning

INFO-F-422

Gianluca Bontempi

Département d’Informatique

Boulevard de Triomphe - CP 212

http://www.ulb.ac.be/di

Machine learning – p. 1/51

Page 2: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Approaches to parametric estimationThere are two main approaches to parametric estimation

Classical or frequentist: it is based on the idea that sample data are the solequantifiable form of relevant information and that the parameters arefixed but unknown . It is related to the frequency view of probability.

Bayesian approach: the parameters are supposed to be random variables ,having a distribution prior to data observation and a distributionposterior to data observation. This approach assumes that there existssomething beyond data, (i.e. a subjective degree of belief), and that thisbelief can be described in probabilistic form.

Machine learning – p. 2/51

Page 3: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Classical approach: some historyIt dates back to the period 1920-35

• J. Neyman and E.S. Pearson, stimulated by problems in biology andindustry, concentrated on the principles for testing hypothesis

• R.A. Fisher who was interested in agricultural issues gave attention tothe estimation problem

Machine learning – p. 3/51

Page 4: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Parametric estimationConsider a r.v. z. Suppose that

1. we do not know completely the distribution Fz(z) but that we can write itin a parametric form

Fz(z) = Fz(z, θ)

where θ ∈ Θ is a parameter,

2. we have access to a set DN of N measurements of z, called sampledata.

• Goal of the estimation procedure: to find a value θ of the parameter θ sothat the parametrized distribution Fz(z, θ) closely matches thedistribution of data.

• We assume that the N observations are the observed values of N i.i.d.random variables zi, each having a density identical to Fz(z, θ).

Machine learning – p. 4/51

Page 5: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

I.I.D. samples• Consider a set of N i.i.d. random variables zi.

• I.I.D. means Identically and Independently Distributed.

• Identically distributed means that all the observations have beensampled from the same distribution, that is

Prob {zi = z} = Prob {zj = z} for all i, j = 1, . . . , N and z ∈ Z

• Independently distributed means that the fact that we have observed acertain value zi does not influence the probability of observing the valuezj , that is

Prob {zj = z|zi = zi} = Prob {zj = z}

Machine learning – p. 5/51

Page 6: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Some estimation problems1. Let DN = {20, 31, 14, 11, 19, . . . } be the times in minutes spent the last 2

weeks to go home. How much does it take in average to reach my housefrom ULB?

2. Consider the model of the traffic in the boulevard.Suppose that the measures of the inter-arrival times areDN = {10, 11, 1, 21, 2, . . . } seconds.What does this imply about the average inter-arrival time?

3. Consider the students of the last year of Computer Science. What is thevariance of their grades?

Machine learning – p. 6/51

Page 7: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Parametric estimation (II)Parametric estimation is a mapping from the space of the sample data to thespace of parameters Θ. Two are the possible outcomes

1. some specific value of Θ. In this case we have the so-called point

estimation .

2. some particular region of Θ. In this case we obtain the interval of

confidence .

Machine learning – p. 7/51

Page 8: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Point estimation• Consider a random variable z with a parametric distribution Fz(z, θ),

θ ∈ Θ.

• The parameter can be written as a function(al) of F

θ = t(F )

This corresponds to the fact that θ is a characteristic of the populationdescribed by Fz(·).

• Suppose we have a set of N observations DN = {z1, z2, . . . , zN}.• Any function of the sample data DN is called a statistic. A point estimate

is an example of statistic.

• A point estimate is a function

θ = g(DN )

of the sample dataset DN .

Machine learning – p. 8/51

Page 9: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Methods of constructing estimatorsExamples are:

• Plug-in principle

• Maximum likelihood

• Least squares

• Minimum Chi-Squared

Machine learning – p. 9/51

Page 10: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Empirical distribution functionSuppose we have observed a i.i.d. random sample of size N from adistribution function Fz(z) of a continuous rv z

Fz → {z1, z2, . . . , zN}

whereFz(z) = Prob {z ≤ z}

Let N(z) be the number of samples in DN that do not exceed z.The empirical distribution function is

Fz(z) =N(z)

N=

#zi ≤ z

N

This function is a staircase function with discontinuities at the points zi.

Machine learning – p. 10/51

Page 11: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

TP R: empirical distribution• Suppose that our dataset of observations of the age is made of the

following N = 14 samples

DN = {20, 21, 22, 20, 23, 25, 26, 25, 20, 23, 24, 25, 26, 29}

• Here it is the empirical distribution function Fz (cumdis.R)

20 22 24 26 28 30

0.0

0.2

0.4

0.6

0.8

1.0

Empirical Distribution function

x

Fn(

x)

Machine learning – p. 11/51

Page 12: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Plug-in principle to define an estimator• Consider a r.v. z and sample dataset DN drawn from the parametric

distribution Fz(z, θ).

• How to define an estimate of θ?

• A possible solution is given by the plug-in principle , that is a simplemethod of estimating parameters from samples.

• The plug-in estimate of a parameter θ is defined to be:

θ = t(F (z))

obtained by replacing the distribution function with the empiricaldistribution in the analytical expression of the parameter

• The sample average is an example of plug-in estimate

µ =

∫zdF (z) =

1

N

N∑

i=1

zi

Machine learning – p. 12/51

Page 13: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Sample average• Consider a r.v. z ∼ Fz(·) such that

θ = E[z] =

∫zdF (z)

with θ unknown.

• Suppose we have available the sample Fz → DN , made of N

observations.

• The plug-in point estimate of θ is given by the sample average

θ =1

N

N∑

i=1

zi = µ

which is indeed a statistic, i.e. a function of the data set.

Machine learning – p. 13/51

Page 14: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Sample variance• Consider a r.v. z ∼ Fz(·) where the mean µ and the variance σ2 are

unknown.

• Suppose we have available the sample Fz → DN .

• Once we have the sample average µ, the plug-in estimate of σ2 is givenby the sample variance

σ2 =1

N − 1

N∑

i=1

(zi − µ)2

• Note the presence of N − 1 instead of N at the denominator (it will beexplained later).

• Note that the following relation holds for all zi

1

N

N∑

i=1

(zi − µ)2 =

(1

N

N∑

i=1

z2i

)− µ2

Machine learning – p. 14/51

Page 15: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Other plug-in estimatorsSkewness estimator:

γ =1N

∑Ni=1(zi − µ)3

σ3

Upper critical point estimator:

zα = sup{z : F (z) ≤ 1− α}

Sample correlation:

ρ(x,y) =

∑Ni=1(xi − µx)(yi − µy)√∑N

i=1(xi − µx)2√∑N

i=1(yi − µy)2

Machine learning – p. 15/51

Page 16: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Sampling distribution• Given a dataset DN , we have a point estimate

θ = g(DN )

which is a specific value.

• However it is important to remark that DN is the outcome of thesampling of a r.v. z. As a consequence DN can be considered asrealization of a random variable DN .

• Applying the transformation g to the random variable DN we obtainanother random variable

θ = g(DN )

which is called the point estimator of θ.

• The probability distribution of the r.v. θ is called the sampling distribution.

Machine learning – p. 16/51

Page 17: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Sampling distributionSAMPLING DISTRIBUTION

N

(1)

DN

(2)

DN

(...)

DN

(3)

N

(1)

θN

θ(2)

(3)

N

(...)

θ

p( )

UNKNOWN R.V. DISTRIBUTION

D

See the R file sam_dis.R.

Machine learning – p. 17/51

Page 18: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Bias and varianceEstimations θ are the first outputs of a data analysis. The next thing we wantto know is the accuracy of θ. This lead us to the definition of bias, varianceand standard error of an estimator.

Definition 1. An estimator θ of θ is said to be unbiased if and only if

EDN[θ] = θ

Otherwise, it is called biased with bias

Bias[θ] = EDN[θ]− θ

Definition 2. The variance of an estimator θ of θ is the variance of its sampling distribution

Var[θ]

= EDN[(θ −E[θ])2]

Machine learning – p. 18/51

Page 19: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Some consideration• An unbiased estimator is an estimator that takes on average the right

value.

• Many unbiased estimators may exist for a parameter θ.

• If θ is an unbiased estimator of θ, it may happen that f(θ) be a BIASEDestimator of f(θ).

• A biased estimator with a known bias (not depending on θ) is equivalentto an unbiased estimator since we can easily compensate for the bias.

• Given a r.v. z and the set DN , it can be shown that the sample averageµ and the sample variance σ

2 are unbiased estimators of the mean E[z]

and the variance Var [z], respectively.

• In general σ is not an unbiased estimator of σ even if σ2 is an unbiased

estimator of σ2.

Machine learning – p. 19/51

Page 20: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Bias and variance ofµ• Consider a random variable z ∼ Fz(·).• Let µ and σ2 the mean and the variance of Fz(·), repsectively.

• Suppose we have observed the i.i.d. sample DN ← Fz.

• The following relation holds

EDN[µ] = EDN

[1

N

N∑

i=1

zi

]=

∑Ni=1 E[zi]

N=

N= µ

• This means that the sample average estimator is not biased ! This holdsfor whatever distribution Fz(·).

• Since Cov[zi, zj ] = 0, for i 6= j, the variance of the sample averageestimator is

Var [µ] = Var

[1

N

N∑

i=1

zi

]=

1

N2Var

[N∑

i=1

zi

]=

1

N2Nσ2 =

σ2

N

• See R script sam_dis.R.

Machine learning – p. 20/51

Page 21: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Bias of σ2

Given an i.i.d. DN ← z:

EDN[σ2] = EDN

[1

N − 1

N∑

i=1

(zi − µ)2

]=

N

N − 1EDN

[1

N

N∑

i=1

(zi − µ)2

]=

=N

N − 1EDN

[(1

N

N∑

i=1

z2i

)− µ

2

]

Since E[z2] = µ2 + σ2, Cov[zi, zj ] = 0, the 1st term inside the E[·] is

EDN

[(1

N

N∑

i=1

z2i

)]=

1

NN(µ2 + σ2) = µ2 + σ2

Machine learning – p. 21/51

Page 22: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Bias of σ2

Since E[(∑N

i=1 zi

)2

] = N2µ2 + Nσ2 the 2nd term is

EDN[µ2] =

1

N2EDN

(

N∑

i=1

zi

)2

=1

N2(N2µ2 + Nσ2) = µ2 + σ2/N

It follows that

EDN[σ2] =

N

N − 1

((µ2 + σ2)− (µ2 + σ2/N)

)=

=N

N − 1

(N − 1

Nσ2

)= σ2

• Sample variance (with N − 1 at denominator) is not biased !

• Question? Is µ2 an unbiased estimator of µ2?

Machine learning – p. 22/51

Page 23: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Important relations

E[(z− µ)2] = σ2 = E[z2 − 2µz + µ2] = E[z2]− 2µE[z] + µ2 =

= E[z2]− 2µµ + µ2 = E[z2]− µ2

For N = 2

E[(z1 + z2)2] = E[z2

1] + E[z22] + 2E[z1z2] =

= 2E[z2] + 2µ2 = 4µ2 + 2σ2

For N = 3

E[(z1 +z2 +z3)2] = E[z2

1]+E[z22]+E[z2

3]+2E[z1z2]+2E[z1z3]+2E[z2z3] =

= 3E[z2] + 6µ2 = 9µ2 + 3σ2

In general for N i.i.d. zi, E[(z1 + z2 + · · ·+ zN )2] = N2µ2 + Nσ2.

Machine learning – p. 23/51

Page 24: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Considerations• The results so far are independent of the form F (·) of the distribution.

• The variance of µ is 1/N times the variance of z. This is a reason forcollecting several samples: the larger N , the smaller is Var [µ], so biggerN means a better estimate of µ.

• According to the central limit theorem, under quite general conditions onthe distribution Fz, the distribution of µ will be approximately normal asN gets large, which we can write as

µ ∼ N (µ, σ2/N) for N →∞

• The standard error√

Var [µ] is a common way of indicating statisticalaccuracy. Roughly speaking we expect µ to be less than one standarderror away from µ about 68% of the time, and less than two standarderrors away from µ about 95% of the time .

Machine learning – p. 24/51

Page 25: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Bias/variance decomposition of MSEWhen θ is a biased estimator of θ, its accuracy is usually assessed by itsmean-square error (MSE) rather than by its variance.The MSE is defined by

MSE = EDN[(θ − θ)2]

• The MSE of an unbiased estimator is its variance.

• For a generic estimator it can be shown that

MSE = (EDN[θ]− θ)2 + Var

[θ]

=[Bias[θ]

]2+ Var

[θ]

i.e., the mean-square error is equal to the sum of the variance and thesquared bias. This decomposition is typically called the bias-variance

decomposition.

• See R script mse_bv.R.

Machine learning – p. 25/51

Page 26: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Bias/variance decomposition of MSE (II)

MSE = EDN[(θ − θ)2] =

= EDN[(θ −EDN

[θ] + EDN[θ]− θ)2] =

= EDN[(θ −EDN

[θ])2] + EDN[(EDN

[θ]− θ)2]+

+ EDN[2(θ −EDN

[θ])(EDN[θ]− θ)] =

= EDN[(θ −EDN

[θ])2] + EDN[(EDN

[θ]− θ)2]+

+ 2(θ −EDN[θ])(EDN

[θ]− EDN[θ]) =

= (EDN[θ]− θ)2 + Var

[θ]

=

=[Bias[θ]

]2+ Var

[θ]

Machine learning – p. 26/51

Page 27: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

TP: example• Suppose z1, . . . , zN is a random sample of observations from a

distribution with mean θ and variance σ2.

• Study the unbiasedness of the three estimators of the mean µ:

θ1 = µ =

∑Ni=1 zi

N

θ2 =Nθ1

N + 1

θ3 = z1

Machine learning – p. 27/51

Page 28: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

EfficiencySuppose we have two unbiased and consistent estimators. How to choosebetween them?

Definition 3 (Relative efficiency). . Let us consider two unbiased estimators θ1 and θ2. If

Var[θ1

]< Var

[θ2

]

we say that θ1 is more efficient than θ2.

If the estimators are biased, typically the comparison is done on the basis ofthe mean square error.

Machine learning – p. 28/51

Page 29: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Sampling distributions for Gaussian r.vLet z1, . . . , zN be i.i.d. N (µ, σ2) and let us consider the following samplestatistics

µ =1

N

N∑

i=1

zi, SS =N∑

i=1

(zi − µ)2, σ2 =

SS

N − 1

It can be shown that the following relations hold

1. µ ∼ N (µ, σ2/N) and N(µ− µ)2 ∼ σ2χ21.

2. zi − µ ∼ N (0, σ2), so∑N

i=1(zi − µ)2 ∼ σ2χ2N

3.∑N

i=1(zi − µ)2 = SS + N(µ− µ)2

4. SS ∼ σ2χ2N−1 or equivalently (N−1)σ2

σ2 ∼ χ2N−1. See R script

sam_dis2.R.

5.√

N(µ− µ)/σ ∼ TN−1

6. if E[|z− µ|4] = µ4 then Var[σ

2] = 1N

(µ4 − N−3

N−1σ4)

.

Machine learning – p. 29/51

Page 30: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

LikelihoodLet us consider

1. a density distribution pz(z, θ) which depends on a parameter θ

2. a sample data DN = {z1, z2, . . . , zN} drawn independently from thisdistribution.

The joint probability density of the sample data is

pDN(DN , θ) =

N∏

i=1

pz(zi, θ) = LN (θ)

where for a fixed DN , LN (·) is a function of θ and is called the empiricallikelihood of θ given DN .

Machine learning – p. 30/51

Page 31: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Maximum likelihood• The principle of maximum likelihood was first used by Lambert around

1760 and by D. Bernoulli about 13 years later. It was detailed by Fisherin 1920.

• Idea: given an unknown parameter θ and a sample data DN , themaximum likelihood estimate θ is the value for which the likelihoodLN (θ) has a maximum

θml = arg maxθ∈Θ

LN (θ)

• The estimator θ is called the maximum likelihood estimator (m.l.e.).

• It is usual to consider the log-likelihood lN (θ) since being log(·) amonotone function, we have

θml = arg maxθ∈Θ

LN (θ) = arg maxθ∈Θ

log(LN (θ)) = arg maxθ∈Θ

lN (θ)

Machine learning – p. 31/51

Page 32: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Example: maximum likelihood• Let us observe N = 10 realizations of a continuous variable z:

DN = {z1, . . . , z10} =

= {1.263,−0.326, 1.330, 1.272, 0.415,−1.540,−0.929,−0.295,−0.006, 2.405}

• Suppose that the probabilistic model underlying the data is Gaussianwith an unknown mean µ and a known variance σ2 = 1.

• The likelihood LN (µ) is a function of (only) the unknown parameter µ.

• By applying the maximum likelihood technique we have

µ = arg maxµ

L(µ) = arg maxµ

N∏

i=1

(1√2πσ

e−(zi−µ)2

2σ2

)

Machine learning – p. 32/51

Page 33: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Example: maximum likelihood (II)By plotting LN (µ), µ ∈ [−2, 2] we have

-2 -1 0 1 2

0.0e+00

5.0e-08

1.0e-07

1.5e-07

mu

L

Then the most likely value of µ according to the data is µ ≈ 0.358. Note that

in this case µ =P

zi

N .R script ml_norm.R

Machine learning – p. 33/51

Page 34: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Some considerations• The likelihood measures the relative abilities of the various parameter

values to explain the observed data.

• The principle of m.l. is that the value of the parameter under which theobtained data would have had highest probability of arising must beintuitively our best estimator of θ.

• M.l. can be considered a measure of how plausible the parametervalues are in light of the data.

• The likelihood function is a function of the parameter θ.

• According to the classical approach to estimation, since θ is not a r.v.,the likelihood function is NOT the probability function of θ.

• LN (θ) is rather the probability of observing the dataset DN for a given θ.

• In other terms the likelihood is the probability of the data given theparameter and not the probability of the parameter given the data.

Machine learning – p. 34/51

Page 35: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Example: log likelihoodConsider the previous example.The behaviour of the log-likelihood for this model is

-2 -1 0 1 2

-40

-35

-30

-25

-20

-15

mu

log(L)

Machine learning – p. 35/51

Page 36: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

M.l. estimationIf we the take a parametric approach, the analytical form of the log-likelihoodlN (θ) is known. In many cases the function lN (θ) is well behaved in beingcontinuous with a single maximum away from the extremes of the range ofvariation of θ.

Then θ is obtained simply as the solution of

∂lN (θ)

∂θ= 0

subject to∂2lN (θ)

∂θ2

∣∣∣θml

< 0

to ensure that the identified stationary point is a maximum.

Machine learning – p. 36/51

Page 37: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Gaussian case: ML estimators• Let DN be a random sample from the r.v. z ∼ N (µ, σ2).

• The likelihood of the N samples is given by

LN (µ, σ2) =N∏

i=1

pz(zi, µ, σ2) =N∏

i=1

(1√2πσ

)exp

[−(zi − µ)2

2σ2

]

• The log-likelihood is

lN (µ, σ2) = log LN (µ, σ2) = log

[N∏

i=1

pz(zi, µ, σ2)

]=

=N∑

i=1

log pz(zi, µ, σ2) = −∑N

i=1(zi − µ)2

2σ2+ N log

(1√2πσ

)

• Note that, for a given σ, maximizing the log-likelihood is equivalent tominimize the sum of squares of the difference between zi and the mean.

Machine learning – p. 37/51

Page 38: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Gaussian case: ML estimators (II)• Taking the derivatives with respect to µ and σ2 and setting them equal to

zero, we obtain

µml =

∑Ni=1 zi

N= µ

σ2ml =

∑Ni=1(zi − µml)

2

N6= σ2

• Note that the m.l. estimator of the mean coincides with the sampleaverage but that the m.l. estimator of the variance differs from thesample variance for the different denominator.

Machine learning – p. 38/51

Page 39: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

TP1. • Let z ∼ U(0, M) and Fz → DN = {z1, . . . , zN}.• Find the maximum likelihood estimator of M .

2. • Let z have a Possion distribution, i.e.

pz(z, λ) =e−λλz

z!

• If Fz(z, λ)→ DN = {z1, . . . , zN}, find the m.l.e. of λ

Machine learning – p. 39/51

Page 40: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

M.l. estimation with numerical methodsComputational difficulties may arise if

1. No explicit solution exists for ∂lN (θ)/∂θ = 0. Iterative numerical methodsmust be used (see Calcul numérique). This is particularly serious for avector of parameters θ or when there are several relative maxima of lN .

2. lN (θ) may be discontinuous, or have a discontinuous first derivative, or amaximum at an extremal point.

Machine learning – p. 40/51

Page 41: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

TP: Numerical optimization in R• Suppose we known the analytical form of a one dimensional function

f(x) : I → R.

• We want to find the value of x ∈ I that minimizes the function.

• If no analytical solution is available, numerical optimization methods canbe applied (see course “Calcul numérique”).

• In the R language these methods are already implemented• Let f(x) = (x− 1/3)2 and I = [0, 1]. The minimum is given by

f <- function (x,a) (x-a)^2

xmin <- optimize(f, c(0, 1), tol = 0.0001, a = 1/3)

xmin

Machine learning – p. 41/51

Page 42: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

TP: Numerical max. of likelihood in R• Let DN be a random sample from the r.v. z ∼ N (µ, σ2).• The minus log-likelihood function of the N samples can be written in R

by

eml <- function(m,D,var) {

N<- length(D)

Lik<-1

for (i in 1:N)

Lik<-Lik*dnorm(D[i],m,sqrt(var))

-log(Lik)

}

• The numerical minimization of −lN (µ, s2) for a given σ = s in the intervalI = [−10, 10] can be written in R in this form

xmin<-optimize( eml,c(-10,10),D=DN,var=s)

• Script emp_ml.R.

Machine learning – p. 42/51

Page 43: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Properties of m.l. estimatorsUnder the (strong) assumption that the probabilistic model structure is known , themaximum likelihood technique features the following propoerties:

• θml is asymptotically unbiased but usually biased in small samples (e.g.σ2

ml).

• the Cramer-Rao theorem establishes a lower bound to the variance ofan estimator

• θml is consistent.

• If θml is the m.l.e. of θ and γ(·) is a monotone function then γ(θml) is them.l.e. of γ(θ).

• If γ(·) is a nonlinear function, then even if θml is an unbiased estimator of

θ, the m.l.e. γ(θml) of γ(θ) is usually biased.

• θml is asymptotically normally distributed around θ.

Machine learning – p. 43/51

Page 44: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Interval estimation• Unlike point estimation which is based on a one-to-one mapping from

the space of data to the space of parameters, interval estimation mapsDN to an interval of Θ.

• An estimator is a function which, given a dataset DN generated fromFz(z, θ), returns an estimate of θ.

• An interval estimator is a trasformation which, given a dataset DN , returnsan interval estimate of θ.

• While an estimator is a random variable, an interval estimator is arandom interval.

• Let θ and θ be the lower and the upper bound respectively.

• While an interval either contains or not a certain value, a random intervalhas a certain probability of containing a value.

Machine learning – p. 44/51

Page 45: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Interval estimation (II)• Suppose that our interval estimator satisfies

Prob{θ ≤ θ ≤ θ

}= 1− α α ∈ [0, 1]

then the random interval [θ, θ] is called a 100(1− α)% confidenceinterval of θ.

• Notice that θ is a fixed unknown value and that at each realization DN

the interval either does or does not contain the true θ.

• If we repeat the procedure of sampling DN and constructing theconfidence interval many times, then our confidence interval will containthe true θ at least 100(1− α)% of the time (i.e. 95% of the time ifα = 0.05).

• While an estimator is characterized by bias and variance, an intervalestimator is characterized by its endpoints and confidence .

Machine learning – p. 45/51

Page 46: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Upper critical pointsDefinition 4. The upper critical point of a continuous r.v. x is the smallest number xα such

that

1− α = Prob {x ≤ xα} = F (xα)⇔ xα = F−1(1− α)

• We will denote with zα the upper critical points of the standard normaldensity.

1− α = Prob {z ≤ zα} = F (zα) = Φ(zα)

• Note that the following relations hold

zα = Φ−1(1− α), z1−α = −zα, 1− α =1√2π

∫ zα

−∞

e−z2/2dz

• Here we list the most commonly used values of zα.

α 0.1 0.075 0.05 0.025 0.01 0.005 0.001 0.0005

zα 1.282 1.440 1.645 1.967 2.326 2.576 3.090 3.291

Machine learning – p. 46/51

Page 47: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

TP: Upper critical points in RFile stu_gau.R

> plot(dnorm(x),type="l")

> lines(dt(x,df=10),type="l",col="red")> lines(dt(x,df=3),type="l",col="green")

> qnorm(0.05,lower.tail=F)[1] 1.644854

> qt(0.05,lower.tail=F,df=10)[1] 1.812461

Distribution R function

Normal standard qnorm(α,lower.tail=F)

Student with N d.o.f qt(α,df=N,lower.tail=F)

Machine learning – p. 47/51

Page 48: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Confidence interval ofµ• Consider a random sample DN of a normal r.v. N (µ, σ2) and the

estimator µ ∼ N (µ, σ2/N).

• It follows that

µ− µ

σ/√

N∼ N (0, 1),

µ− µ

σ/√

N∼ TN−1

and consequently

Prob{−zα/2 ≤

µ− µ

σ/√

N≤ zα/2

}= 1− α

Prob{

µ− zα/2σ√N≤ µ ≤ µ + zα/2

σ√N

}= 1− α

• θ = µ− zα/2σ/√

N is a lower 1− α confidence bound for µ.

• θ = µ + zα/2σ/√

N is an upper 1− α confidence bound for µ.

• By varying α we vary width and confidence of the interval.

Machine learning – p. 48/51

Page 49: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Example: Confidence interval• Let z ∼ N (µ, 0.0001) and DN = {10, 11, 12, 13, 14, 15}.• We want to estimate the confidence interval of µ with level α = 0.1.

• We have µ = 12.5, and

δα = zα/2 ∗ σ/√

N = 1.645 ∗ 0.01/√

6 = 0.0672

• The 90% confidence interval for the given DN is

{µ : |µ− µ| ≤ δα} = {12.5− 0.0672 ≤ µ ≤ 12.5 + 0.0672}

Machine learning – p. 49/51

Page 50: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

TP R script confidence.R• The user sets µ, σ, N , α and a number of iterations Niter.

• The script generates Niter times DN ∼ N (µ, σ2) and computes µ.

• The script returns the percentage of times that

µ− zα/2σ√N

< µ < µ +zα/2σ√

N

• We can easily check that this percentage converges to (1− α)% forNiter →∞.

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

Niter

1−α

Machine learning – p. 50/51

Page 51: Statistical foundations of machine learning. Parametric Estimation.pdf · Statistical foundations of machine learning ... that the fact that we have observed a certain value z i does

Example: Confidence region (II)• Let z ∼ N (µ, σ2), with σ2 unknown and DN = {10, 11, 12, 13, 14, 15}.• We want to estimate the confidence region of µ with level α = 0.1.

• We have µ = 12.5, σ2 = 3.5. Since

µ− µ√σ

2

N

∼ TN−1

where TN−1 is a Student distribution with N − 1 degrees of freedom, wehave

tα = t{α/2,N−1}σ/√

N = 2.015 ∗ 1.87/√

6 = 1.53

• The (1− α) confidence interval of µ is

µ− tα < µ < µ + tα

Machine learning – p. 51/51