generative vs. discriminative models, maximum likelihood ...flaxman/ht17_lecture5.pdf · generative...

31
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Generative vs. Discriminative Models, Maximum Likelihood Estimation, Mixture Models Mihaela van der Schaar Department of Engineering Science University of Oxford

Upload: others

Post on 21-Mar-2020

32 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs. Discriminative Models,Maximum Likelihood Estimation, Mixture Models

Mihaela van der Schaar

Department of Engineering ScienceUniversity of Oxford

Page 2: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

2/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs Discriminative Approaches

Machine learning: learn a (random) function that maps avariable X (feature) to a variable Y (class) using a (labeled)dataset M = {(X1,Y1), . . . , (Xn,Yn)}.

Discriminative Approach: learn P(Y |X ).Generative Approach: learn P(Y ,X ) = P(Y |X )P(X ).

−6 −4 −2 0 2

−2

−1

0

1

2

3

4

5

X1

X2

Class 1

Class 2

Page 3: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

3/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs Discriminative Approaches

Discriminative Approach: Finds a good fit for P(Y |X )without explicitly modeling the generative process.Example techniques: K nearest neighbors, logisticregression, linear regression, SVMs, perceptrons, etc.Example problem: 2 classes, separate the classes.

−6 −4 −2 0 2

−2

−1

0

1

2

3

4

5

X1

X2

Class 1

Class 2

Page 4: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs Discriminative Approaches

Generative Approach: Finds a probabilistic model (a jointdistribution P(Y ,X )) that explicitly models the distribution ofboth the features and the corresponding labels (classes).

Example techniques: Naive Bayes, Hidden Markov Models,etc.

−5 0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(Y

|X)

P (Y = 0|X)

P (Y = 1|X)

−5 0 50

0.1

0.2

0.3

0.4

X

P(X

)

Page 5: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

5/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs Discriminative Approaches

Generative Approach: Finds parameters that explain alldata.

Makes use of all the data.Flexible framework, can incorporate many tasks(e.g. classification, regression, survival analysis, generatingnew data samples similar to the existing dataset, etc).Stronger modeling assumptions.

Discriminative Approach: Finds parameters that help topredict relevant data.

Learns to perform better on the given tasks.Weaker modeling assumptions.Less immune to overfitting.

Page 6: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Problem Setup

We are given a dataset D = (Xi )ni=1 with n entries.

Example: Xi ’s are the annual incomes of n individuals pickedrandomly from a large population.

Goal: estimate the probability distribution that describesthe entire population from which the random samples (Xi )

ni=1

are drawn.

−3−2

−10

12

3

−2

0

2

0

0.1

0.2

0.3

0.4ProbabilityDensity

−1.5 −1 −0.5 0 0.5 1 1.5−3

−2

−1

0

1

2

3

X1

X2

What we observe: random samples drawn from a distribution What we want to estimate: the distribution!

X2

X1

Page 7: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

7/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Models and Likelihood Functions

Models: Parametric Families of Distributions

Key to make progress: restrict to a parametrized family ofdistributions!

Formalization becomes as follows:The dataset D comprise independent and identicallydistributed (iid) samples from a distribution Pθ with aparameter θ, i.e.

Xi ∼ Pθ, (X1,X2, . . .,Xn) ∼ P⊗nθ .

The distribution Pθ belongs to the family P, i.e.

P = {Pθ : θ ∈ Θ} ,

where Θ is a parameter space.Estimating the distribution Pθ ∈ P → estimating theparameter θ ∈ Θ!

Page 8: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Models and Likelihood Functions

The Likelihood Function

How is the family of models P related to the dataset D?

The likelihood function Ln(θ,D): is defined as

Ln(θ,D) =n∏

i=1

Pθ(Xi ).

Intuitively, Ln(θ,D) quantifies how compatible is any choiceof θ with the occurrence of D.

Maximum Likelihood Estimator (MLE)

Given a dataset D of size n drawn from a distribution Pθ ∈ P, theMLE estimate of θ is defined as

θ̂∗n = argmaxθ∈Θ

Ln(θ,D).

Page 9: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

9/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Why Maximum Likelihood Estimation?

Why is θ̂∗n a good estimator for θ?

1 Consistency: the estimate θ̂∗n converges to θ in probability !

θ̂∗np→ θ.

2 Asymptotic Normality: can compute asymptotic confidenceintervals ! √

n (θ̂∗n − θ)d→ N (0, σ2(θ)).

3 Asymptotic Efficiency: the asymptotic variance of θ̂∗n is, infact, equal to the Cramer-Rao lower bound for the varianceof a consistent, asymptotically normally distributed estimator !

4 Invariance under re-parametrization: If g(.) is acontinuous and continuously differentiable function, then theMLE of g(θ) is g(θ̂∗n).

See proofs in (Keener, Chapter 8).

Page 10: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The Gaussian Family

The Gaussian Family P and Parameter Space Θ

The dataset D = (X1,X2, . . .,Xn) is drawn from a distributionP⊗nθ =

∏ni=1 Pθ(Xi ), where

Pθ(X ) =1√2π σ

· exp(−(X − µ)2

2σ2

),

where θ = (µ, σ).

The parameter space Θ is

Θ ={(µ, σ) : µ ∈ R, σ ∈ R+

}.

The family P is the family of Gaussian distributions given by

P =

{1√2π σ

· e−(X−µ)2

2σ2 : µ ∈ R, σ ∈ R+

}.

Page 11: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The Gaussian Family

The Gaussian Likelihood Function

The likelihood function is given by

L(θ,D) =n∏

i=1

1√2π σ

· e−(Xi−µ)2

2σ2 = (√2π σ)−n · e

−12σ2

∑ni=1(Xi−µ)2

The MLE estimate θ̂∗n = (µ̂∗n, σ̂

∗n) is given by

(µ̂∗n, σ̂

∗n) = argmax

µ∈R,σ∈R+(√2π σ)−n · e

−12σ2

∑ni=1(Xi−µ)2

It is usually more convenient to work with the log-likelihoodfunction

log (L(θ,D)) =−n

2log(2πσ2)− 1

2σ2

n∑i=1

(Xi − µ)2

Page 12: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

12/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator

Finding µ̂∗n and σ̂∗

n (I)

The log(.) operation is monotonic, therefore

argmaxθ∈Θ

log(L(θ,D)) = argmaxθ∈Θ

L(θ,D)

Can solve the optimization problem argmaxθ L(θ,D) byequating the first derivative of log(L(θ,D)) with respect to θand equating to zero (first-order condition), i.e.

∂θlog(L(θ,D)) = 0.

What properties log(L(θ,D)) must have for the abovemethod to work?

Page 13: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

13/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator

Finding µ̂∗n and σ̂∗

n (I)

The log(.) operation is monotonic, therefore

argmaxθ∈Θ

log(L(θ,D)) = argmaxθ∈Θ

L(θ,D)

Can solve the optimization problem argmaxθ L(θ,D) byequating the first derivative of log(L(θ,D)) with respect to θand equating to zero (first-order condition), i.e.

∂θlog(L(θ,D)) = 0.

What properties log(L(θ,D)) must have for the abovemethod to work?

Concavity and log-concavity!

Page 14: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

14/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator

Finding µ̂∗n and σ̂∗

n (II)

Note that θ = (µ, σ) is vector-valued: the first-order conditionbecomes

▽θlog(L(θ,D)) = 0 →[∂log(L(θ,D))

∂µ

∂log(L(θ,D))

∂σ

]T= 0

By taking the first derivative with respect to µ and σ, we havethat:

∂µ

(−n log(2πσ2)

2−

∑ni=1(Xi − µ)2

2σ2

)=

∑ni=1(Xi − µ)

σ2

∂σ

(−n log(2πσ2)

2−

∑ni=1(Xi − µ)2

2σ2

)=

−n

σ+

∑ni=1(Xi − µ)2

σ3

Page 15: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

15/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator

Finding µ̂∗n and σ̂∗

n (III)

The MLE estimators are:

Sample Mean:∑ni=1(Xi − µ)

σ2= 0 → µ̂∗

n =1

n

n∑i=1

Xi

Sample Variance:

−n

σ+

∑ni=1(Xi − µ)2

σ3= 0 → (σ̂∗

n)2 =

1

n

n∑i=1

(Xi − µ̂∗n)

2

Page 16: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

16/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator

Finding µ̂∗n and σ̂∗

n (IV)

Exercise: try to derive the MLE estimator when X is amultivariate Gaussian distribution:

The dataset D = (X1, . . .,Xn), where Xi is M-dimensional,and Xi ∼ N (µ,Σ).The parameter space is Θ =

{(µ,Σ) : µ ∈ RM ,Σ ≽ 0

}The multivariate Gaussian distribution is

Pθ(X) = (2π)−M2 · |Σ|

−12 · exp

(−1

2(X− µ)TΣ−1(X− µ)

)

Page 17: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

17/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Confidence Intervals

What is our confidence in the estimates µ̂∗n and σ̂∗

n?

Depends on the sample size n: the larger n, the smaller thevariances of µ̂∗

n and σ̂∗n.

0 50 100 150 200 250 300 350 400 450 500−2

0

2

4

6

8

10

Sample size n

µ̂∗ n

µ = 5,σ =√

3

µ̂∗

n= 5

n = 50

Larger variance

Less confidence

n = 250

Smaller variance

More confidence

Page 18: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

18/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Confidence Intervals

Confidence Sets

Point estimators provide no quantification for uncertainty →need to introduce a measure of confidence in an estimate!

Confidence Sets

A (1− α) confidence set for a parameter θ is a subset of theparameter space, Θ̃(X1, . . .,Xn) ⊂ Θ, such that

P(θ ∈ Θ̃) ≥ 1− α.

Confidence intervals are one-dimensional confidence sets.

Because of the asymptotic normality of the general MLEestimates, we can compute asymptotic confidence intervals.

Normality → compute confidence intervals.

Asymptotic normality → compute asymptotic confidenceintervals.

Page 19: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

19/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Confidence Intervals

Example: Unknown Mean and Known Variance (I)

Assume we know σ and want to estimate µ from DThe MLE estimate for µ is µ̂∗

n = 1n

∑ni=1 Xi

We know that√nσ (µ̂∗

n − µ) ∼ N (0, 1) (central limittheorem)

We want to compute the confidence intervalΘ̃ = [µ̂∗

n − µ̃, µ̂∗n + µ̃] (symmetric normal distribution)

Computing the Confidence Interval for µ

Find γ for which P(√

nσ (µ̂∗

n − µ) ≥ γ)= α

2 → γ = Q−1(α2 ),

where Q(x) = 1√2π

∫∞x exp

(−u2

2

)du.

P(√

nσ |µ̂∗

n − µ| ≤ Q−1(α2 ))= 1− α ⇐⇒ µ̃ = σ√

nQ−1(α2 )

Page 20: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

20/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Confidence Intervals

Example: Unknown Mean and Known Variance (II)

The 95% confidence interval for the MLE mean estimate isΘ̃ = [µ̂∗

n − 1.96 σ√n, µ̂∗

n + 1.96 σ√n]

0 50 100 150 200 250 300 350 400 450 5000

2

4

6

8

10

12

14µ = 5,σ = 3

Sample size n

µ̂∗ n

µ̂∗

n

95% confidence interval gets

narrower as sample size increases

Page 21: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

21/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The Categorical distribution

The Categorical Family P and the Parameter Space Θ

Each data point takes a value from a finite set of values:Xi ∈ {1, 2, . . ., r}.The probability that Xi = j is given by pj ∈ [0, 1].

The parameter of a categorical distribution is θ = (p1, . . ., pr ).

The parameter space is the simplex

Θ ={(p1, . . ., pr ) : (p1, . . ., pr ) ∈ [0, 1]r ,

∑rj=1 pj = 1

}.

The probability mass function of the dataset D:

Pθ(X1, . . .,Xn) =n∏

i=1

r∏j=1

p1{Xi=j}j

= pn11 · pn22 . . . pnrr , nj =n∑

i=1

1{Xi=j}.

Page 22: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

22/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator

Finding θ̂∗n (I)

The log-likelihood function: log(L(θ,D)) =∑r

j=1 nj · log(pj).

The MLE estimate θ̂∗n is

θ̂∗n = argmaxp1, . . .,pr

r∑j=1

nj log(pj)

s.t.r∑

j=1

pj = 1.

Constrained optimization: Not as easy as in the Gaussiancase!

Page 23: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

23/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator

Finding θ̂∗n (II): Method A

The Method of Lagrange Multipliers

Maximize∑r

j=1 nj log(pj)− λ (∑r

j=1 pj − 1) via the first-ordercondition.

▽θlog(L(θ,D)) = 0 → p∗j ,n = λ−1nj .

Since∑

nj = n and∑

p∗j ,n = 1, then λ = n, and

θ̂∗n =[n1n

. . .nrn

]T.

The MLE θ̂∗n is the empirical distribution function, whichuniformly converges to the true probability mass function.

This matches our expectations regarding the consistency of θ̂∗n.

Page 24: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator

Finding θ̂∗n (III): Method B

An Information-Theoretic Approach

We can reformulate the MLE optimization problem as follows

θ̂∗n = argmaxp1, . . .,pr

r∑j=1

nj log(pj) = argmaxp1, . . .,pr

r∑j=1

njnlog(pj)

= argmaxp1, . . .,pr

r∑j=1

qj log

(pjqj

)+

r∑j=1

qj log(qj), qj =njn.

= argmaxp1, . . .,pr

r∑j=1

qj log

(pjqj

)︸ ︷︷ ︸KL divergence

= argminp

D(q||p)

Since D(q||p) ≥ 0, then the minimum is achieved at q = p,and we have p∗j ,n = qj =

njn .

Page 25: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

25/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator

Confidence Intervals

Unlike the case of the Gaussian distribution, the MLEestimators are not normally distributed for any n.

For large n, we can construct asymptotic confidence intervalsusing asymptotic normality.

For arbitrary n, consider the case when r = 2 (Bernoullidistribution), we have θ = (p1, p2 = 1− p1). A conservative(1− α) confidence interval for the parameter p1 is

Θ̃ =

[n1n

− Q−1(α/2)

2√n

,n1n

+Q−1(α/2)

2√n

]Derivation follows from the central limit theorem andbounding the variance of a Bernoulli random variable by√

p1(1− p1) ≤ 12 .

Page 26: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

26/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The Generative Model

Heterogeneous Populations

In many applications, the data is sampled from K differentpopulations with different parameters.

Example: Gaussian mixture with 3 components.

−3−2

−10

12

3

−3−2

−10

12

30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

X1

X2

ProbabilityDensity

Page 27: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

27/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The Generative Model

Gaussian Mixture Models

A K -component (univariate) Gaussian mixture model has thefollowing parameters

The Gaussian parameters θk = (µk , σk), 1 ≤ k ≤ K .

The mixing proportion πk ,∑K

i=1 πk = 1.

The dataset D has the following structure:

D = ((X1,Z1), . . ., (Xn,Zn)).

Each variable Xi is drawn from a Gaussian distribution

Xi ∼ N (µk , σ2k)

The variable Zi represent the membership of the i th data pointto one of the K components, and is drawn from a categoricaldistribution

Zi ∼ Categorical(π1, . . ., πK ).

Page 28: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

28/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

MLE for Mixture Models

Complete-Data and Marginal Likelihood Functions

Two possible scenarios:

If Zi is observed: then the complete-data likelihood functionis uni-modal:

L(θ,D) =n∏

i=1

πzi ·1√

2π σzi· e

−(Xi−µzi

)2

2σ2zi

If Zi is latent: then the marginal likelihood function ismulti-modal:

L(θ,D) =n∏

i=1

K∑k=1

πk · 1√2π σk

· e− (Xi−µk )

2

2σ2k

Hard problem → approximate solution using the EMalgorithm (next lecture)!

Page 29: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

29/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Maximum Likelihood is a Frequentist Methods

A frequentist method...

1 Never uses or gives the probability of a hypothesis (no prior orposterior).

2 Depends on the likelihood for both observed and unobserveddata.

3 Does not require a prior.

4 Tends to be less computationally intensive.

Frequentist measures such as p-values and confidenceintervals are often used in current research practices sincethe 20th century.

Page 30: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

30/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Bayesian Methods

On the other hand, a Bayesian method...

1 Assumes a prior: uses probabilities for both hypotheses anddata.

2 Depends on the prior and likelihood of observed data.

3 Requires one to know or construct a subjective prior.

4 May be computationally intensive due to integration overmany parameters.

Many recent advances in Bayesian methods! Read about:variational Bayesian methods, Markov Chain Monte Carlomethods, etc.

Page 31: Generative vs. Discriminative Models, Maximum Likelihood ...flaxman/HT17_lecture5.pdf · Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

31/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

References

1 Robert W Keener, ”Statistical theory: notes for a course intheoretical statistics,” 2006.

2 Robert W Keener, ”Theoretical Statistics: Topics for a CoreCourse,” 2010.

3 Christopher Bishop, ”Pattern Recognition and MachineLearning,” 2007.