bayesian learning & estimation theory

Post on 05-Jan-2016

77 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Bayesian Learning & Estimation Theory. Example: For Gaussian likelihood P ( x | q ) = N ( x |  ,  2 ),. Objective of regression: Minimize error. E ( w ) = ½ S n ( t n - y ( x n , w ) ) 2. L =. Maximum likelihood estimation. Precision b =1/ s 2. - PowerPoint PPT Presentation

TRANSCRIPT

Bayesian Learning & Estimation Theory

Maximum likelihood estimation

L =

• Example: For Gaussian likelihood P(x|) = N (x|,2),

Objective of regression: Minimize error

E(w) = ½ n ( tn - y(xn,w) )2

A probabilistic view of linear regression

• Compare to error function: E(w) = ½ n ( tn - y(xn,w) )2

• Since argminw E(w) = argmaxw , regression is equivalent to ML estimation of w

Precision

=1/2

Bayesian learning

• View the data D and parameter as random variables (for regression, D = (x, t) and = w)

• The data induces a distribution over the parameter:

P( |D) = P(D,) / P(D) P(D,)

• Substituting P(D,) = P(D |) P(), we obtain Bayes’ theorem:

P( |D) P(D |) P()Posterior Likelihood x Prior

Bayesian prediction

• Predictions (eg, predict t from x using data D) are mediated through the parameter:

P(prediction|D) = P(prediction|) P(|D) d

• Maximum a posteriori (MAP) estimation:

MAP = argmaxP(|D)

P(prediction|D) P(prediction| MAP)

– Accurate when P(|D) is concentrated on MAP

A probabilistic view of regularized regression

• E(w) = ½ n ( tn - y(xn,w) )2 + /2m wm2

• Prior: w’s are IID Gaussian

p(w) = m (1/ 2-1 ) exp{- wm2 / 2 }

• Since argminw E(w) = argmaxw p(t|x,w) p(w), regularized regression is equivalent to MAP estimation of w

ln p(w)ln p(t|x,w)

Bayesian linear regression

• Likelihood:

– specifies precision of data noise

• Prior:

– specifies precision of weights

• Posterior:

– This is an M+1 dimensional Gaussian density

• Prediction:

m = 0

M

wm| 0,-1 Computed using linear algebra (see

textbook)

Example: y(x) = w0 + w1x

No data

1st point

2nd point

20th point

...

Data Posteriory(x) sampled

from posteriorPriorLikelihood

Example: y(x) = w0 + w1x + … + wMxM

• M = 9, = 5x10-3: Gives a reasonable range of functions

• = 11.1: Known precision of noise

Mean and one std dev of the predictive distribution

Example: y(x) = w0 + w11(x) + … + wMM(x)

Gaussian basis functions:

0 1

How are we doing on the pass sequence?

• Least squares regression…

Han

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

The red line doesn’t reveal different levels of uncertainty in predictions

Cross validation reduced the training data, so the red line isn’t as accurate as it should be

Choosing a particular M and w seems wrong – we should hedge our bets

How are we doing on the pass sequence?

Ha

nd

-lab

ele

d h

ori

zon

tal

co

ord

ina

te, t

The red line doesn’t reveal different levels of uncertainty in predictions

Cross validation reduced the training data, so the red line isn’t as accurate as it should be

Choosing a particular M and w seems wrong – we should hedge our bets

Han

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

Bayesian regression

Estimation theory

• Provided with a predictive distribution p(t|x), how do we estimate a single value for t?– Example: In the pass sequence,

Cupid must aim at and hit the man in the white shirt, without hitting the man in the striped shirt

• Define L(t,t*) as the loss incurred by estimating t*

when the true value is t• Assuming p(t|x) is correct, the expected loss is

E[L] = t L(t,t*) p(t|x) dt

• The minimum loss estimate is found by minimizing E[L] w.r.t. t*

Squared loss

• A common choice: L(t,t*) = ( t - t* )2

E[L] = t ( t - t* )2 p(t|x) dt– Not appropriate for Cupid’s problem

• To minimize E[L] , set its derivative to zero:

dE[L]/dt* = -2t ( t - t* ) p(t|x) dt = 0

-2t t p(t|x)dt + t* = 0

• Minimum mean squared error (MMSE) estimate:

t* = E[t|x] = t t p(t|x)dt

For regression: t* = y(x,w)

Other loss functions

Squared loss

Absolute loss

Absolute loss

L = |t*-t1| + |t*-t2| + |t*-t3| + |t*-t4| + |t*-t5| + |t*-t6| + |t*-t7|

• Consider moving t* to the left by – L decreases by 6 and increases by – Changes in L are balanced when t* = t4

• The median of t under p(t|x) minimizes absolute loss

• Important: The median is invariant to monotonic transformations of t

tt*t1 t2 t3 t4 t5 t6 t7

Mean and medianMedian Mean

D-dimensional estimation

• Suppose t is D-dimensional, t = (t1,…,tD)– Example: 2-dimensional tracking

• Approach 1: Minimum marginal loss estimation

– Find td* that minimizes t L(td,td*) p(td|x) dtd

• Approach 2: Minimum joint loss estimation– Define joint loss L(t,t*)

– Find t* that minimizes t L(t,t*) p(t|x) dt

Questions?

Feature, xHan

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

Compute 1st moment: x = 224

How are we doing on the pass sequence?• Bayesian regression and estimation enables us to track

the man in the striped shirt based on labeled data• Can we track the man in the white shirt?

t = 290

0 320

Horizontal location

Fra

cti

on

of

pix

els

in

colu

mn

wit

h i

nte

nsi

ty >

0.9

Man in white shirt is

occluded

How are we doing on the pass sequence?• Bayesian regression and estimation enables us to

track the man in the striped shirt based on labeled data• Can we track the man in the white shirt?

Not very well.

Feature, xHan

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

Regression fails to identify that there really are two classes of solution

top related