probability and stats intro2

38
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008

Upload: mailund

Post on 30-Jun-2015

642 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Probability And Stats Intro2

Crash course in probability theory and statistics – part 2

Machine Learning, Wed Apr 16, 2008

Page 2: Probability And Stats Intro2

Motivation

All models are wrong, but some are useful.

This lecture introduces distributions that have proven useful in constructing models.

Page 3: Probability And Stats Intro2

Densities, statistics and estimators

A probability (density) is any function X a p(X ) that

satisfies the probability theory axioms.

A statistic is any function of observed data x a f(x).

An estimator is a statistic used for estimating a parameter

of the probability density x a m.

Page 4: Probability And Stats Intro2

Estimators

Assume D = {x1,x2,...,xN} are independent, identically

distributed (i.i.d) outcomes of our experiments (observed data).

Desirable properties of an estimator are:

for N®¥

and(unbiased)

Page 5: Probability And Stats Intro2

Estimators

A general way to get an estimator is using the maximum likelihood (ML) or maximum a posterior (MAP) approach:

ML:

MAP:

...possibly compensating for any bias in these.

Page 6: Probability And Stats Intro2

Bayesian estimation

In a fully Bayesian approach we instead update our distribution based on observed data:

Page 7: Probability And Stats Intro2

Conjugate priors

If the prior distribution belongs to a certain class of

functions, p(m) Î C, and the product of prior and

likelihood belongs to the same class

p(x|m) p(m) Î C, then the prior is called a conjugate

prior.

Page 8: Probability And Stats Intro2

Conjugate priors

Typical situation:

where f is a well known function with known

normalizing constant

If the prior distribution belongs to a certain class of

functions, p(m) Î C, and the product of prior and

likelihood belongs to the same class

p(x|m) p(m) Î C, then the prior is called a conjugate

prior.

Page 9: Probability And Stats Intro2

Conjugate priors

Page 10: Probability And Stats Intro2

Bernoulli and binomial distribution

Bernoulli distribution: single event with binary outcome.

Binomial: sum of N Bernoulli outcomes.

Mean for Bernoulli:

Mean for Binomial:

Page 11: Probability And Stats Intro2

Bernoulli and binomial distribution

Bernoulli maximum likelihood for

Sufficient statistic

Page 12: Probability And Stats Intro2

Bernoulli and binomial distribution

Bernoulli maximum likelihood for

ML estimate:

Page 13: Probability And Stats Intro2

Bernoulli and binomial distribution

Bernoulli maximum likelihood for

ML estimate:Average, so

Page 14: Probability And Stats Intro2

Bernoulli and binomial distribution

Similar for binomial distribution:

ML estimate:

Page 15: Probability And Stats Intro2

Beta distribution

Beta distribution:

Gamma function

Page 16: Probability And Stats Intro2

Beta distribution

Beta distribution:

Page 17: Probability And Stats Intro2

Beta distribution

Beta distribution:

Beta function

Page 18: Probability And Stats Intro2

Beta distribution:

Beta distribution

Normalizing constant

Page 19: Probability And Stats Intro2

Beta distribution:

Beta distribution

Conjugate to Bernoulli/Binomial:

Posterior distribution:

Page 20: Probability And Stats Intro2

Beta distribution:

Beta distribution

Conjugate to Bernoulli/Binomial:

Posterior distribution:

Page 21: Probability And Stats Intro2

Multinomial distribution

One out of K classes: x bitvector with

Distribution:

Likelihood:

Sufficientstatistic

Page 22: Probability And Stats Intro2

Multinomial distribution

Maximum likelihood estimate:

Page 23: Probability And Stats Intro2

Dirichlet distribution

Dirichlet distribution:

Page 24: Probability And Stats Intro2

Dirichlet distribution

Dirichlet distribution:

Normalizing constant

Page 25: Probability And Stats Intro2

Dirichlet distribution

Dirichlet distribution:

Conjugate to Multinomial:

Page 26: Probability And Stats Intro2

Gaussian/Normal distribution

Scalar variable:

Vector variable:

Symmetric, real, D ´ D matrix

Page 27: Probability And Stats Intro2

Geometry of a Gaussian

Sufficient statistic:

There's a linear transformation U so

where L is a diagonal matrix. Then

where

Page 28: Probability And Stats Intro2

Geometry of a Gaussian

Gaussian constant when

is constant – an ellipsis in the U coordinate system:

Page 29: Probability And Stats Intro2

Geometry of a Gaussian

Gaussian constant when

is constant – an ellipsis in the U coordinate system:

Page 30: Probability And Stats Intro2

Parameters are mean and variance/covariance:

Parameters of a Gaussian

ML estimates are:

Page 31: Probability And Stats Intro2

Parameters of a Gaussian

ML estimates are:

Unbiased

Biased

Page 32: Probability And Stats Intro2

Parameters of a Gaussian

ML estimates are:

Unbiased

Biased

Intuition: The variance estimator is based on the mean estimator – which is fitted to data and fits better than the real mean, thus the variance is under estimated.

Correction:

Page 33: Probability And Stats Intro2

Gaussian is its own conjugate

For fixed variance:

Normalizing constant

Page 34: Probability And Stats Intro2

For fixed variance:

Gaussian is its own conjugate

with:

Page 35: Probability And Stats Intro2

Mixtures of Gaussians

A Gaussian has a single mode (peak) and cannot model multi-modal distributions.

Instead, we can use mixtures:

Page 36: Probability And Stats Intro2

Mixtures of Gaussians

A Gaussian has a single mode (peak) and cannot model multi-modal distributions.

Instead, we can use mixtures: Prob. for selecting

given Gaussian

Conditionaldistribution

Page 37: Probability And Stats Intro2

Mixtures of Gaussians

A Gaussian has a single mode (peak) and cannot model multi-modal distributions.

Instead, we can use mixtures: Prob. for selecting

given Gaussian

Conditionaldistribution

Numerical methods needed for parameter estimation

Page 38: Probability And Stats Intro2

Summary

● Bernoulli and Binomial, with Beta prior

● Multinomial with Dirichlet prior

● Gaussian with Gaussian prior

● Mixtures of Gaussians