[7] point estimation

Point estimations

Agenda• Maximum Likelihood Estimation (MLE)

• Method of Moments (MoM)

• Bayesian estimation

• Maximum a posteriori estimation (MAP)

• Posterior mean

Maximum Likelihood Estimation

Likelihood functionThe likelihood of unknown parameter given your data.

θ

https://en.wikipedia.org/wiki/Likelihood_function

• The likelihood function is a function of

• It is not a probability density function

• It measures the “support” (i.e. likelihood) provided by the data for each possible value of the parameter.

θ

MLE• Find the parameter that is

“most likely” to observe your data.

• Maximizing the likelihood function is equivalent to maximizing the log-likelihood function (for computational issues)

https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1



How to maximize a function?

http://tutorial.math.lamar.edu/Classes/CalcI/MinMaxValues.aspx

Example: coin tossing

Gradient descent

https://iamtrask.github.io/2015/07/27/python-network-part2/

Learning rate

https://blog.csdn.net/leadai/article/details/78662036

Local optimums

https://iamtrask.github.io/2015/07/27/python-network-part2/

Method of Moments

Expectation• If is a discrete random variable with p.m.f. ,

• If is a continuous random variable with p.d.f. ,

• are called moments with

X fX(x)

E[X] = ∑x

xfX(x)

XfX(x)

E[X] = ∫xxfX(x)

E [Xk] k ≥ 1

Law of large number• If , then converges to (in probability) as

• LLN holds for every moments

X1, X2, ⋯, Xniid∼ fX(x) X =

1n

n

∑i=1

Xi

E[Xi] n → ∞

Method of moments• The values of moments depend on the values of

unknown parameters (moment conditions)

• By LLN, we can estimate moments by sample means

θ

Example: normal distribution

Let

• moment conditions:

• By solving the above moment conditions we have

X1, X2, ⋯, Xniid∼ N(μ, σ2)

1n

n

∑i=1

Xi ≈ E[X1] = μ

1n

n

∑i=1

X2i ≈ E[X2

1] = μ2 + σ2

μ =1n

n

∑i=1

Xi, and σ2 =1n

n

∑i=1

X2i − μ2

Example: linear regressionLet with and

• Moment conditions:

Yiiid∼ N (μ(xi), σ2) xi = [xi1, ⋯, xip]

T∈ ℝp

μ(xi) = β0 + β1xi1 + ⋯ + βpxip

E [Y − μ(x)] = 0

E [xj (Y − μ(x))] = 0

E [(Y − μ(x))2] = σ2

• Plug the sample moments into the moment conditions, we obtain

1n

n

∑i=1

(Yi − (β0 + βixi1 + ⋯ + βpxip)) ≈ 0

1n

n

∑i=1

xij (Yi − (β0 + βixi1 + ⋯ + βpxip)) ≈ 0

1n

n

∑i=1

(Yi − (β0 + βixi1 + ⋯ + βpxip))2

≈ σ2

Solving linear systemsThe solution of a linear system can be found (if it exists) by

• Gaussian elimination (numpy.linalg.solve)

• Minimize (numpy.linalg.lstsq)

Ax = b

∥Ax = b∥2

https://en.wikipedia.org/wiki/Gaussian_elimination

https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.linalg.solve.html

https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq

Solving nonlinear systemsThe solution of a nonlinear system can be found (if it exists) by various root-finding algorithms (scipy.optimize.root)

F(x) = 0

http://www.csie.ntnu.edu.tw/~u91029/RootFinding.html

https://en.wikipedia.org/wiki/Root-finding_algorithm

https://en.wikipedia.org/wiki/Root-finding_algorithm

https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.root.html#scipy.optimize.root

Newton’s method

Secant method

Pros• Easy to compute and always work

• MoM is consistent; i.e. as θMoM → θ n → ∞

Cons• MoM may not be unique: different moment

conditions yields to different results!

• Not the most efficient (i.e. achieving minimum mean squared error, MSE) estimators

• Sometimes MoM may be meaningless

MoM may be meaningless• Suppose we observe 3, 5, 6, 18 from

• Since for , the MoM of is

• This estimation is not acceptable since we have already observed a 18.

U(0,θ)

E[X] = θ/2 X ∼ U(0,θ) θ

θMoM = 2X = 2 ×3 + 5 + 6 + 18

4= 16

Extensions• Generalized MoM

• Generalized moment conditions

• The number of conditions may exceed the number of parameters

• Method of simulated moments

• Approximate the theoretical moments when they are not available

https://en.wikipedia.org/wiki/Generalized_method_of_moments

https://en.wikipedia.org/wiki/Method_of_simulated_moments

Bayesian estimation

Key concepts• Since (derived from some random sample) is

random, we can treat as random and specify its probability distribution as

• The probability is usually specified by a data scientist to express one's beliefs. Thus, we call it a “prior”.

θθ

π(θ)

π(θ)

Maximum a posteriori estimation

• The posterior distribution can be interpreted as the conditional probability of given observational data

• Thus, similar to MLE, we may find the mode of the posterior since it is the most likely value of

θ

θ

θMAP = arg max f(X1, ⋯, Xn |θ)π(θ)

Posterior mean• The “mean” of the posterior is another

frequently used Bayesian estimator,

• The expectation is often approximated by Markov chain Monte Carlo (MCMC) method since the above integration is usually difficult

θ = E[θ |X1, ⋯, Xn] = ∫ θp (θ |X1, ⋯, Xn) dθ

https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo

Example: normal distribution

Let and we assume that . Then and

X1, X2, ⋯, Xniid∼ N(μ, σ2)

μ ∼ N(μ0, τ2)

p(μ |X1, ⋯, Xn) ∝1

2πτe− 1

2 ( μ − μ0τ )

2 n

∏i=1

1

2πσe− 1

2 ( Xi − μσ )

2

μ =nτ2

nτ2 + σ2X +

1nτ2 + σ2

μ0

(My unfair) suggestions• Use Bayesian estimations when you have a

domain expert; otherwise, use MLE

• Use MoM only for computational issues

• The posterior (or likelihood function) is not convex

• Big data

Homework: logistic regression

• Breast Cancer Wisconsin (Diagnostic) Data Set (also available in scikit-learn)

• Assume that withEstimate the unknown coefficientsby either MLE or MoM. Compare your results with the the ones provided by scikit-learn (example) with very large C.

Yi ∈ {0,1} iid∼ Bernoulli(p(xi))

p(xi) =exp [β0 + β1xi1 + ⋯ + βpxip]

1 + exp [β0 + β1xi1 + ⋯ + βpxip]θ = [β0, β1, ⋯, βp]′�

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://pythonhealthcare.org/2018/04/15/66-machine-learning-your-first-ml-model-using-logistic-regression-to-diagnose-breast-cancer/

• Moment conditions:

• Plug the sample moments into the moment conditions, we obtain

E [Y − p(x)] = 0

E [xj (Y − p(x))] = 0

1n

n

∑i=1

(Yi − p (xi)) = 0

1n

n

∑i=1

xij (Yi − p (xi)) = 0

Readings• Chapters 10.1-10.3 and 12.1-12.2 of “All of

statistics”

[7] point estimation

Documents