bayesian learning & estimation theory

Bayesian Learning & Estimation Theory

Maximum likelihood estimation

• Example: For Gaussian likelihood P(x|) = N (x|,2),

Objective of regression: Minimize error

E(w) = ½ n ( tn - y(xn,w) )2

A probabilistic view of linear regression

• Compare to error function: E(w) = ½ n ( tn - y(xn,w) )2

• Since argminw E(w) = argmaxw , regression is equivalent to ML estimation of w

Precision

Bayesian learning

• View the data D and parameter as random variables (for regression, D = (x, t) and = w)

• The data induces a distribution over the parameter:

P( |D) = P(D,) / P(D) P(D,)

• Substituting P(D,) = P(D |) P(), we obtain Bayes’ theorem:

P( |D) P(D |) P()Posterior Likelihood x Prior

Bayesian prediction

• Predictions (eg, predict t from x using data D) are mediated through the parameter:

P(prediction|D) = P(prediction|) P(|D) d

• Maximum a posteriori (MAP) estimation:

MAP = argmaxP(|D)

P(prediction|D) P(prediction| MAP)

– Accurate when P(|D) is concentrated on MAP

A probabilistic view of regularized regression

• E(w) = ½ n ( tn - y(xn,w) )2 + /2m wm2

• Prior: w’s are IID Gaussian

p(w) = m (1/ 2-1 ) exp{- wm2 / 2 }

• Since argminw E(w) = argmaxw p(t|x,w) p(w), regularized regression is equivalent to MAP estimation of w

ln p(w)ln p(t|x,w)

Bayesian linear regression

• Likelihood:

– specifies precision of data noise

• Prior:

– specifies precision of weights

• Posterior:

– This is an M+1 dimensional Gaussian density

• Prediction:

wm| 0,-1 Computed using linear algebra (see

textbook)

Example: y(x) = w0 + w1x

No data

1st point

2nd point

20th point

Data Posteriory(x) sampled

from posteriorPriorLikelihood

Example: y(x) = w0 + w1x + … + wMxM

• M = 9, = 5x10-3: Gives a reasonable range of functions

• = 11.1: Known precision of noise

Mean and one std dev of the predictive distribution

Example: y(x) = w0 + w11(x) + … + wMM(x)

Gaussian basis functions:

How are we doing on the pass sequence?

• Least squares regression…

The red line doesn’t reveal different levels of uncertainty in predictions

Cross validation reduced the training data, so the red line isn’t as accurate as it should be

Choosing a particular M and w seems wrong – we should hedge our bets

How are we doing on the pass sequence?

The red line doesn’t reveal different levels of uncertainty in predictions

Cross validation reduced the training data, so the red line isn’t as accurate as it should be

Choosing a particular M and w seems wrong – we should hedge our bets

Bayesian regression

Estimation theory

• Provided with a predictive distribution p(t|x), how do we estimate a single value for t?– Example: In the pass sequence,

Cupid must aim at and hit the man in the white shirt, without hitting the man in the striped shirt

• Define L(t,t*) as the loss incurred by estimating t*

when the true value is t• Assuming p(t|x) is correct, the expected loss is

E[L] = t L(t,t*) p(t|x) dt

• The minimum loss estimate is found by minimizing E[L] w.r.t. t*

Squared loss

• A common choice: L(t,t*) = ( t - t* )2

E[L] = t ( t - t* )2 p(t|x) dt– Not appropriate for Cupid’s problem

• To minimize E[L] , set its derivative to zero:

dE[L]/dt* = -2t ( t - t* ) p(t|x) dt = 0

-2t t p(t|x)dt + t* = 0

• Minimum mean squared error (MMSE) estimate:

t* = E[t|x] = t t p(t|x)dt

For regression: t* = y(x,w)

Other loss functions

Squared loss

Absolute loss

L = |t*-t1| + |t*-t2| + |t*-t3| + |t*-t4| + |t*-t5| + |t*-t6| + |t*-t7|

• Consider moving t* to the left by – L decreases by 6 and increases by – Changes in L are balanced when t* = t4

• The median of t under p(t|x) minimizes absolute loss

• Important: The median is invariant to monotonic transformations of t

tt*t1 t2 t3 t4 t5 t6 t7

Mean and medianMedian Mean

D-dimensional estimation

• Suppose t is D-dimensional, t = (t1,…,tD)– Example: 2-dimensional tracking

• Approach 1: Minimum marginal loss estimation

– Find td* that minimizes t L(td,td*) p(td|x) dtd

• Approach 2: Minimum joint loss estimation– Define joint loss L(t,t*)

– Find t* that minimizes t L(t,t*) p(t|x) dt

Questions?

Feature, xHan

Compute 1st moment: x = 224

How are we doing on the pass sequence?• Bayesian regression and estimation enables us to track

the man in the striped shirt based on labeled data• Can we track the man in the white shirt?

t = 290

Horizontal location

Man in white shirt is

occluded

How are we doing on the pass sequence?• Bayesian regression and estimation enables us to

track the man in the striped shirt based on labeled data• Can we track the man in the white shirt?

Not very well.

Feature, xHan

Regression fails to identify that there really are two classes of solution

bayesian learning & estimation theory

Documents

bayesian estimation using stochastic simulation: theory e

bayesian estimation and § inference

l4: bayesian decision...

bayesian decision theory

bayesian estimation of synaptic physiology from the spectral...

3 recursive bayesian estimation

bayesian estimation under informative sampling

bayesian belief networks compound bayesian decision theory

bayesian estimation of dynamical systems: an application to...

overview multiple imputation for multilevel data bayesian...

v10: bayesian parameter estimation

bayesian filterings for location estimation

introduction to bayesian models with stata · bayesian...

recursive bayesian estimation

bayesian learning & estimation theory. maximum likelihood...

chapter 3 (part 2): maximum-likelihood and …1 chapter 3...

bayesian estimation of discrete duration models · bayesian...

bayesian entropy estimation for countable discrete...

bayesian divergence time estimation

estimation theory - university of texas at...