model inference and averaging

43
Model Inference and Model Inference and Averaging Averaging Prof. Liqing Zhang Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University

Upload: harris

Post on 17-Jan-2016

101 views

Category:

Documents


2 download

DESCRIPTION

Model Inference and Averaging. Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University. Parametric Model. Assume a parameterized probability density (parametric model) for observations. Maximum Likelihood Inference. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Model Inference and Averaging

Model Inference and Model Inference and AveragingAveraging

Prof. Liqing ZhangProf. Liqing Zhang

Dept. Computer Science & Engineering,

Shanghai Jiaotong University

Page 2: Model Inference and Averaging

23/4/21 Model Inference and Averaging 2

Page 3: Model Inference and Averaging

23/4/21 Model Inference and Averaging 3

Parametric Model

• Assume a parameterized probability density (parametric model) for observations

Page 4: Model Inference and Averaging

23/4/21 Model Inference and Averaging 4

Maximum Likelihood Inference

• Suppose we are trying to measure the true value of some quantity (xT).

– We make repeated measurements of this quantity {x1, x2, … xn}.

– The standard way to estimate xT from our measurements is to calculate the mean value:

and set xT = μx.

x 1

Nxi

i1

N

DOES THIS PROCEDURE MAKE SENSE?

The maximum likelihood method (MLM) answers this question and provides a

general method for estimating parameters of interest from data.

Page 5: Model Inference and Averaging

23/4/21 Model Inference and Averaging 5

The Maximum Likelihood Method

• Statement of the Maximum Likelihood Method– Assume we have made N measurements of x

{x1, x2, …, xn}.– Assume we know the probability distribution

function that describes x: f(x, a).

– Assume we want to determine the parameter a.

• MLM: pick a to maximize the probability of

getting the measurements (the xi's) we did!

Page 6: Model Inference and Averaging

23/4/21 Model Inference and Averaging 6

• The probability of measuring x1 is• The probability of measuring x2 is• The probability of measuring xn is• If the measurements are independent, the

probability of getting the measurements we did is:

• We can drop the dxn term as it is only a proportionality constant.

])[,(),(),(),(),(),(

21

21n

n

n

dxxfxfxfdxxfdxxfdxxfL

),(:FunctionLikelihoodthecalledis1

N

iixfLL

The MLM Implementationdxxf ),( 1 dxxf ),( 2 dxxf n ),(

Page 7: Model Inference and Averaging

23/4/21 Model Inference and Averaging 7

• We want to pick the a that maximizes L:

– Often easier to maximize lnL. – L and lnL are both maximum at the same

location.

• we maximize lnL rather than L itself because lnL converts the product into a summation.

L

*0

ln L ln f (xi ,)i1

N

Log Maximum Likelihood Method

Page 8: Model Inference and Averaging

23/4/21 Model Inference and Averaging 8

• The new maximization condition is:

• could be an array of parameters (e.g. slope and intercept) or just a single variable.

• equations to determine a range from simple linear equations to coupled non-linear equations.

0),(lnln

*1*

i

N

i

xfL

Log Maximum Likelihood Method

Page 9: Model Inference and Averaging

23/4/21 Model Inference and Averaging 9

• Let f(x, ) be given by a Gaussian distribution function.

• Let =μbe the mean of the Gaussian. We want to use our data+MLM to find the mean.

• We want the best estimate of a from our set of n measurements {x1, x2, …, xn}.

• Let’s assume that s is the same for each measurement.

An Example: Gaussian

Page 10: Model Inference and Averaging

23/4/21 Model Inference and Averaging 10

An Example: Gaussian

• Gaussian PDF

• The likelihood function for this problem is:

n

i

in

i

xnxxxn

n

i

xn

ii

eeee

exfL

12

2

2

2

2

22

2

21

2

2

2

)(

2

)(

2

)(

2

)(

1

2

)(

1

2

1

2

1

2

1),(

f (xi ,) 1 2

e (x i )2

2 2

Page 11: Model Inference and Averaging

23/4/21 Model Inference and Averaging 11

• We want to find the a that maximizes the log likelihood function:

n

ii

n

ii

n

ii

n

i

i

xn

xx

xn

L

111

2

12

2

10)1)((2;0)(

02

)(

2

1ln

ln

n

i

i

x

nn

ii

xn

exfL

n

i

i

12

2

2

)(

1

2

)()

2

1ln(

)]2

1ln([),(lnln 1

2

2

An Example: Gaussian

Page 12: Model Inference and Averaging

23/4/21 Model Inference and Averaging 12

• If are different for each data point then is just the weighted average:

n

i i

n

i i

ix

12

12

1

Weighted Average

An Example: Gaussian

Page 13: Model Inference and Averaging

23/4/21 Model Inference and Averaging 13

• Let f(x, ) be given by a Poisson distribution.

• Let = μ be the mean of the Poisson.

• We want the best estimate of a from our set of n measurements {x1, x2, … xn}.

• Poisson PDF:

!),(

x

exf

x

An Example: Poisson

Page 14: Model Inference and Averaging

23/4/21 Model Inference and Averaging 14

An Example: Poisson

• The likelihood function for this problem is:

!!..!!...

!!

!),(

2121

11

121

n

xn

n

xxx

n

i i

xn

ii

xxx

e

x

e

x

e

x

e

x

exfL

n

ii

n

i

Page 15: Model Inference and Averaging

23/4/21 Model Inference and Averaging 15

An Example: Poisson

• Find a that maximizes the log likelihood function:

n

ii

n

ii

n

n

ii

xn

xn

xxxxnd

d

d

Ld

1

1

211

1

01

)!!..!ln(lnln

Average

Page 16: Model Inference and Averaging

23/4/21 Model Inference and Averaging 16

• For large data samples (large n) the likelihood function, L, approaches a Gaussian distribution.

• Maximum likelihood estimates are usually consistent.– For large n the estimates converge to the true

value of the parameters we wish to determine.

• Maximum likelihood estimates are usually unbiased.– For all sample sizes the parameter of interest is

calculated correctly.

General properties of MLM

Page 17: Model Inference and Averaging

23/4/21 Model Inference and Averaging 17

General properties of MLM

• Maximum likelihood estimate is efficient: the estimate has the smallest variance.

• Maximum likelihood estimate is sufficient: it uses all the information in the observations (the xi’s).

• The solution from MLM is unique.

• Bad news: we must know the correct probability distribution for the problem at hand!

Page 18: Model Inference and Averaging

23/4/21 Model Inference and Averaging 18

Maximum Likelihood

• We maximize the likelihood function

• Log-likelihood function

Page 19: Model Inference and Averaging

23/4/21 Model Inference and Averaging 19

Score Function

• Assess the precision of using the likelihood function

• Assume that L takes its maximum in the interior parameter space. Then

Page 20: Model Inference and Averaging

23/4/21 Model Inference and Averaging 20

Likelihood Function

• We maximize the likelihood function

• We omit normalization since only adds a constant factor

• Think of L as a function of with Z fixed

• Log-likelihood function

Page 21: Model Inference and Averaging

23/4/21 Model Inference and Averaging 21

Fisher Information

• Negative sum of second derivatives is the information matrix

• is called the observed information, should be greater 0.

• Fisher information ( expected information ) is

Page 22: Model Inference and Averaging

23/4/21 Model Inference and Averaging 22

Sampling Theory

• Basic result of sampling theory

• The sampling distribution of the max-likelihood estimator approaches the following normal distribution, as

• When we sample independently from

• This suggests to approximate the distribution with

Page 23: Model Inference and Averaging

23/4/21 Model Inference and Averaging 23

Error Bound

• The corresponding error estimates are obtained from

• The confidence points have the form

Page 24: Model Inference and Averaging

23/4/21 Model Inference and Averaging 24

Simplified form of the Fisher information

Suppose, in addition, that the operations of integration and differentiation can be swapped for the second derivative of f(x;θ) as well, i.e.,

In this case, it can be shown that the Fisher information equals

The Cramér–Rao bound can then be written as

Page 25: Model Inference and Averaging

23/4/21 Model Inference and Averaging 25

Single-parameter proof

If the expectation of T is denoted by ψ(θ), then, for all θ,

Let X be a random variable with probability density function f(x;θ). Here T = t(X) is a statistic, which is used as an estimator for ψ(θ). If V is the score, i.e.

then the expectation of V, written E(V), is zero. If we consider the covariance cov(V,T) of V and T, we have cov(V,T) = E(VT), because E(V) = 0. Expanding this expression we have

Page 26: Model Inference and Averaging

Using the chain rule

and the definition of expectation gives, after cancelling f(x;θ),

because the integration and differentiation operations

commute (second condition).

The Cauchy-Schwarz inequality shows that

Therefore

23/4/21 Model Inference and Averaging 26

Page 27: Model Inference and Averaging

An Example

• Consider a linear expansion

• The least square error solution

• The Covariance of \beta

23/4/21 Model Inference and Averaging 27

1

( ) ( )N

j jj

x h x

1ˆ ( )T TH H H y

1 2

2 2

ˆ ˆ( ) ( ) ;

ˆ ˆ( ( )) /

T

i i

cor H H

y x N

Page 28: Model Inference and Averaging

An Example

• The confidence region

23/4/21 Model Inference and Averaging 28

1

1 1/2

ˆˆ prediction model ( ) ( ),

The standard deviation

ˆ ˆ [ ( )] [ ( ) ( ) ( )]

N

j jj

T T

Consider x h x

se x h x H H h x

ˆ ˆ( ) 1.96 [ ( )]x se x

Page 29: Model Inference and Averaging

23/4/21 Model Inference and Averaging 29

Bayesian Methods• Given a sampling model Pr(Z | ) and a prior Pr( )

for the parameters, estimate the posterior probability

• By drawing samples or estimating its mean or mode

• Differences to mere counting ( frequentist approach )– Prior: allow for uncertainties present before seeing the

data– Posterior: allow for uncertainties present after seeing the

data

Page 30: Model Inference and Averaging

23/4/21 Model Inference and Averaging 30

Bayesian Methods

• The posterior distribution affords also a predictive distribution of seeing future values

• In contrast, the max-likelihood approach would predict future data on the basis of

not accounting for the uncertainty in the parameters

newZ

Page 31: Model Inference and Averaging

An Example

• Consider a linear expansion

• The least square error solution

23/4/21 Model Inference and Averaging 31

1

( ) ( )N

j jj

x h x

Page 32: Model Inference and Averaging

• The posterior distribution for \beta is also Gaussian, with mean and covariance

• The corresponding posterior values for \mu(x),

23/4/21 Model Inference and Averaging 32

Page 33: Model Inference and Averaging

23/4/21 Model Inference and Averaging 33

Page 34: Model Inference and Averaging

23/4/21 Model Inference and Averaging 34

Bootstrap vs Bayesian

• The bootstrap mean is an approximate posterior average

• Simple example:– Single observation z drawn from a normal

distribution– Assume a normal prior for :– Resulting posterior distribution

Page 35: Model Inference and Averaging

23/4/21 Model Inference and Averaging 35

Bootstrap vs Bayesian

• Three ingredients make this work– The choice of a noninformative prior for – The dependence of on Z only through

the max-likelihood estimate Thus

– The symmetry of

Page 36: Model Inference and Averaging

Bootstrap vs Bayesian• The bootstrap distribution represents an (approximate)

nonparametric, noninformative posterior distribution for our parameter.

• But this bootstrap distribution is obtained painlessly without having to formally specify a prior and without having to sample from the posterior distribution.

• Hence we might think of the bootstrap distribution as a \poor man's" Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typically much simpler to carry out.

23/4/21 Model Inference and Averaging 36

Page 37: Model Inference and Averaging

23/4/21 Model Inference and Averaging 37

The EM Algorithm

• The EM algorithm for two-component Gaussian mixtures – Take initial guesses for the

parameters

– Expectation Step: Compute the responsibilities

2211ˆ,ˆ,ˆ,ˆ,ˆ

Page 38: Model Inference and Averaging

23/4/21 Model Inference and Averaging 38

The EM Algorithm

– Maximization Step: Compute the weighted means and variances

– Iterate 2 and 3 until convergence

Page 39: Model Inference and Averaging

23/4/21 Model Inference and Averaging 39

The EM Algorithm in General

• Baum-Welch algorithm

• Applicable to problems for which maximizing the log-likelihood is difficult but simplified by enlarging the sample set with unobserved ( latent ) data ( data augmentation ).

Page 40: Model Inference and Averaging

23/4/21 Model Inference and Averaging 40

The EM Algorithm in General

Page 41: Model Inference and Averaging

23/4/21 Model Inference and Averaging 41

The EM Algorithm in General

• Start with initial params

• Expectation Step: at the j-th step compute

as a function of the dummy argument

• Maximization Step: Determine the new params by maximizing

• Iterate 2 and 3 until convergence

Page 42: Model Inference and Averaging

23/4/21 Model Inference and Averaging 42

Model Averaging and Stacking

• Given predictions • Under square-error loss, seek weights

• Such that

• Here the input x is fixed and the N observations in Z are distributed according to P

Page 43: Model Inference and Averaging

23/4/21 Model Inference and Averaging 43

Model Averaging and Stacking

• The solution is the population linear regression of Y on namely

• Now the full regression has smaller error, namely

• Population linear regression is not available, thus replace it by the linear regression over the training set