# generative vs. discriminative models, maximum likelihood ... flaxman/ht17_ · pdf file...

Post on 21-Mar-2020

5 views

Embed Size (px)

TRANSCRIPT

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs. Discriminative Models, Maximum Likelihood Estimation, Mixture Models

Mihaela van der Schaar

Department of Engineering Science University of Oxford

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

2/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs Discriminative Approaches

Machine learning: learn a (random) function that maps a variable X (feature) to a variable Y (class) using a (labeled) dataset M = {(X1,Y1), . . . , (Xn,Yn)}.

Discriminative Approach: learn P(Y |X ). Generative Approach: learn P(Y ,X ) = P(Y |X )P(X ).

−6 −4 −2 0 2

−2

−1

0

1

2

3

4

5

X1

X 2

Class 1

Class 2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

3/31

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs Discriminative Approaches

Discriminative Approach: Finds a good fit for P(Y |X ) without explicitly modeling the generative process. Example techniques: K nearest neighbors, logistic regression, linear regression, SVMs, perceptrons, etc. Example problem: 2 classes, separate the classes.

−6 −4 −2 0 2

−2

−1

0

1

2

3

4

5

X1

X 2

Class 1

Class 2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4/31

Generative vs Discriminative Approaches

Generative Approach: Finds a probabilistic model (a joint distribution P(Y ,X )) that explicitly models the distribution of both the features and the corresponding labels (classes).

Example techniques: Naive Bayes, Hidden Markov Models, etc.

−5 0 5 10 15 20 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P (Y

|X )

P (Y = 0|X)

P (Y = 1|X)

−5 0 5 0

0.1

0.2

0.3

0.4

X

P (X

)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

5/31

Generative vs Discriminative Approaches

Generative Approach: Finds parameters that explain all data.

Makes use of all the data. Flexible framework, can incorporate many tasks (e.g. classification, regression, survival analysis, generating new data samples similar to the existing dataset, etc). Stronger modeling assumptions.

Discriminative Approach: Finds parameters that help to predict relevant data.

Learns to perform better on the given tasks. Weaker modeling assumptions. Less immune to overfitting.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6/31

Problem Setup

We are given a dataset D = (Xi )ni=1 with n entries. Example: Xi ’s are the annual incomes of n individuals picked randomly from a large population.

Goal: estimate the probability distribution that describes the entire population from which the random samples (Xi )

n i=1

are drawn.

−3 −2

−1 0

1 2

3

−2

0

2

0

0.1

0.2

0.3

0.4 P ro b a b il it y D e n si ty

−1.5 −1 −0.5 0 0.5 1 1.5 −3

−2

−1

0

1

2

3

X1

X 2

What we observe: random samples drawn from a distribution What we want to estimate: the distribution!

X2

X1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

7/31

Models and Likelihood Functions

Models: Parametric Families of Distributions

Key to make progress: restrict to a parametrized family of distributions!

Formalization becomes as follows: The dataset D comprise independent and identically distributed (iid) samples from a distribution Pθ with a parameter θ, i.e.

Xi ∼ Pθ, (X1,X2, . . .,Xn) ∼ P⊗nθ .

The distribution Pθ belongs to the family P, i.e.

P = {Pθ : θ ∈ Θ} ,

where Θ is a parameter space. Estimating the distribution Pθ ∈ P → estimating the parameter θ ∈ Θ!

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8/31

Models and Likelihood Functions

The Likelihood Function

How is the family of models P related to the dataset D? The likelihood function Ln(θ,D): is defined as

Ln(θ,D) = n∏

i=1

Pθ(Xi ).

Intuitively, Ln(θ,D) quantifies how compatible is any choice of θ with the occurrence of D.

Maximum Likelihood Estimator (MLE)

Given a dataset D of size n drawn from a distribution Pθ ∈ P, the MLE estimate of θ is defined as

θ̂∗n = argmax θ∈Θ

Ln(θ,D).

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

9/31

Why Maximum Likelihood Estimation?

Why is θ̂∗n a good estimator for θ?

1 Consistency: the estimate θ̂∗n converges to θ in probability !

θ̂∗n p→ θ.

2 Asymptotic Normality: can compute asymptotic confidence intervals ! √

n (θ̂∗n − θ) d→ N (0, σ2(θ)).

3 Asymptotic Efficiency: the asymptotic variance of θ̂∗n is, in fact, equal to the Cramer-Rao lower bound for the variance of a consistent, asymptotically normally distributed estimator !

4 Invariance under re-parametrization: If g(.) is a continuous and continuously differentiable function, then the MLE of g(θ) is g(θ̂∗n).

See proofs in (Keener, Chapter 8).

.

.

.

.

.

.

.

.

.

.

.

.

.