generative vs. discriminative models, maximum likelihood ... flaxman/ht17_  · pdf file...

Click here to load reader

Post on 21-Mar-2020

5 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    1/31

    Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

    Generative vs. Discriminative Models, Maximum Likelihood Estimation, Mixture Models

    Mihaela van der Schaar

    Department of Engineering Science University of Oxford

  • .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    2/31

    Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

    Generative vs Discriminative Approaches

    Machine learning: learn a (random) function that maps a variable X (feature) to a variable Y (class) using a (labeled) dataset M = {(X1,Y1), . . . , (Xn,Yn)}.

    Discriminative Approach: learn P(Y |X ). Generative Approach: learn P(Y ,X ) = P(Y |X )P(X ).

    −6 −4 −2 0 2

    −2

    −1

    0

    1

    2

    3

    4

    5

    X1

    X 2

    Class 1

    Class 2

  • .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    3/31

    Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

    Generative vs Discriminative Approaches

    Discriminative Approach: Finds a good fit for P(Y |X ) without explicitly modeling the generative process. Example techniques: K nearest neighbors, logistic regression, linear regression, SVMs, perceptrons, etc. Example problem: 2 classes, separate the classes.

    −6 −4 −2 0 2

    −2

    −1

    0

    1

    2

    3

    4

    5

    X1

    X 2

    Class 1

    Class 2

  • .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    4/31

    Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

    Generative vs Discriminative Approaches

    Generative Approach: Finds a probabilistic model (a joint distribution P(Y ,X )) that explicitly models the distribution of both the features and the corresponding labels (classes).

    Example techniques: Naive Bayes, Hidden Markov Models, etc.

    −5 0 5 10 15 20 0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    X

    P (Y

    |X )

    P (Y = 0|X)

    P (Y = 1|X)

    −5 0 5 0

    0.1

    0.2

    0.3

    0.4

    X

    P (X

    )

  • .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    5/31

    Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

    Generative vs Discriminative Approaches

    Generative Approach: Finds parameters that explain all data.

    Makes use of all the data. Flexible framework, can incorporate many tasks (e.g. classification, regression, survival analysis, generating new data samples similar to the existing dataset, etc). Stronger modeling assumptions.

    Discriminative Approach: Finds parameters that help to predict relevant data.

    Learns to perform better on the given tasks. Weaker modeling assumptions. Less immune to overfitting.

  • .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    6/31

    Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

    Problem Setup

    We are given a dataset D = (Xi )ni=1 with n entries. Example: Xi ’s are the annual incomes of n individuals picked randomly from a large population.

    Goal: estimate the probability distribution that describes the entire population from which the random samples (Xi )

    n i=1

    are drawn.

    −3 −2

    −1 0

    1 2

    3

    −2

    0

    2

    0

    0.1

    0.2

    0.3

    0.4 P ro b a b il it y D e n si ty

    −1.5 −1 −0.5 0 0.5 1 1.5 −3

    −2

    −1

    0

    1

    2

    3

    X1

    X 2

    What we observe: random samples drawn from a distribution What we want to estimate: the distribution!

    X2

    X1

  • .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    7/31

    Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

    Models and Likelihood Functions

    Models: Parametric Families of Distributions

    Key to make progress: restrict to a parametrized family of distributions!

    Formalization becomes as follows: The dataset D comprise independent and identically distributed (iid) samples from a distribution Pθ with a parameter θ, i.e.

    Xi ∼ Pθ, (X1,X2, . . .,Xn) ∼ P⊗nθ .

    The distribution Pθ belongs to the family P, i.e.

    P = {Pθ : θ ∈ Θ} ,

    where Θ is a parameter space. Estimating the distribution Pθ ∈ P → estimating the parameter θ ∈ Θ!

  • .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    8/31

    Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

    Models and Likelihood Functions

    The Likelihood Function

    How is the family of models P related to the dataset D? The likelihood function Ln(θ,D): is defined as

    Ln(θ,D) = n∏

    i=1

    Pθ(Xi ).

    Intuitively, Ln(θ,D) quantifies how compatible is any choice of θ with the occurrence of D.

    Maximum Likelihood Estimator (MLE)

    Given a dataset D of size n drawn from a distribution Pθ ∈ P, the MLE estimate of θ is defined as

    θ̂∗n = argmax θ∈Θ

    Ln(θ,D).

  • .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    9/31

    Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

    Why Maximum Likelihood Estimation?

    Why is θ̂∗n a good estimator for θ?

    1 Consistency: the estimate θ̂∗n converges to θ in probability !

    θ̂∗n p→ θ.

    2 Asymptotic Normality: can compute asymptotic confidence intervals ! √

    n (θ̂∗n − θ) d→ N (0, σ2(θ)).

    3 Asymptotic Efficiency: the asymptotic variance of θ̂∗n is, in fact, equal to the Cramer-Rao lower bound for the variance of a consistent, asymptotically normally distributed estimator !

    4 Invariance under re-parametrization: If g(.) is a continuous and continuously differentiable function, then the MLE of g(θ) is g(θ̂∗n).

    See proofs in (Keener, Chapter 8).

  • .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .