chapter 15: likelihood, bayesian, and decision theory ams 572 group members yen-hsiu chen, valencia...

50
Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs,

Upload: elizabeth-emma-park

Post on 24-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Chapter 15: Likelihood,

Bayesian, and Decision Theory

AMS 572Group Members

Yen-hsiu Chen, Valencia Joseph, Lola Ojo,

Andrea Roberson, Dave Roelfs, Saskya Sauer, Olivia Shy, Ping Tung

Page 2: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Introduction

Maximum Likelihood, Bayesian, and Decision Theory are applied and have proven its selves useful and necessary in sciences, such as physics, as well as research in general.

They provide a practical way to begin and carry out an analysis or experiment.

"To call in the statistician after the experiment is done may be no more than asking him to perform a

post-mortem examination: he may be able to say what the experiment died of."

- R.A. Fisher

Page 3: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

15.1 Maximum

Likelihood Estimation

Page 4: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

15.1.1 Likelihood Function Objective : Estimating the unknown parameters θof a population distribution based on a random sample χ1,…,χn from that distribution

Previous chapters : Intuitive Estimates => Sample Means for Population Mean

To improve estimation, R. A. Fisher (1890~1962) proposed MLE in 1912~1922.

Page 5: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Ronald Aylmer Fisher (1890~1962)

The greatest of Darwin's successors

Known for : 1912 : Maximum likelihood 1922 : F-test 1925 : Analysis of variance

(Statistical Method for Research Workers )

Notable Prizes : Royal Medal (1938) Copley Medal (1955)Source: http://www-history.mcs.st-

andrews.ac.uk/history/PictDisplay/Fisher.html

Page 6: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Joint p.d.f. vs. Likelihood Function Identical quantities Different interpretation Joint p.d.f. of X1 ,…, Xn :

A function of χ1,…,χn for given θ Probability interpretation

Likelihood Function of θ : A function of θfor given χ1,…,χn No probability interpretation

( ) ( ) ( ) ( ) ( )1 1 2

1

,..., ...n

n n i

i

f x x f x f x f x f xθ θ θ θ θ=

= = ∏

( ) ( ) ( ) ( ) ( )1 1 1

1

,..., ,..., ...n

n n n i

i

L x x f x x f x f x f xθ θ θ θ θ=

= = = ∏

Page 7: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Suppose χ1,…,χn is a random sample from a normal distribution with p.d.f.:

parameter ( ), Likelihood Function:

Example : Normal Distribution

22

21

( )1( , ) [ exp{ }]

22

ni

i

xL

μμ σσσ π=

−= −∏

22

1

1 1( ) exp{ ( ) }

22

nn

i

i

x μσσ π =

= − −∑

2,μ σ

22

2

( )1( | , ) exp{ }

22

xf x

μμ σσσ π−

= −

Page 8: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

15.1.2 Calculation of Maximum Likelihood Estimators (MLE) MLE of an unknown parameter θ:

The value which maximizes the likelihood

function

Example of MLE: 2 independent Bernoulli trials with success probability θ

θis known : 1/4 and 1/3

=>parameter space Θ= {1/4, 1/3}

Using Binomial distribution, the probabilities of observing

χ= 0, 1, 2 successes can be calculated

$ $( )1,..., nx xθ θ=

( )1,..., nL x xθ

Page 9: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

• When χ=0, the MLE of

• When χ=1 or 2, the MLE of

• The MLE is chosen to maximize for observed χ

Example of MLE Probability of ObservingχSuccesses

$ $: 1/ 4θ θ =

the # of successesParameter space Θ

χ

0 1 2

1/4 9/16 6/16 1/16

1/3 4/ 9 4/ 9 1/9

$ $: 1/ 3θ θ =

$θ ( )L xθ

Page 10: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

15.1.3 Properties of MLE’s

Objective

optimality properties in large sample

Fisher information (continuous case)

Alternatives of Fisher information

2 2ln ( | ) ln ( | )

( ) ( | )d f x d f x

I f x dx Ed d

θ θθ θθ θ

−∞

⎧ ⎫⎪ ⎪⎡ ⎤ ⎡ ⎤= = ⎨ ⎬⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦⎪ ⎪⎩ ⎭∫

2ln ( | ) ln ( | )

( )d f x d f x

I E Vard d

θ θθθ θ

⎧ ⎫⎪ ⎪⎡ ⎤ ⎧ ⎫= =⎨ ⎬ ⎨ ⎬⎢ ⎥⎣ ⎦ ⎩ ⎭⎪ ⎪⎩ ⎭

2 2

2

ln ( | ) ln ( | )( ) ( | )

d f x d f xI f x dx E

d d

θ θθ θθ θ

−∞

⎧ ⎫⎡ ⎤⎪ ⎪⎡ ⎤= =− ⎨ ⎬⎢ ⎥⎢ ⎥⎣ ⎦ ⎪ ⎪⎣ ⎦⎩ ⎭∫

(1)

(2)

Page 11: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

2ln ( | ) ln ( | )

( )d f x d f x

I E Vard d

θ θθθ θ

⎧ ⎫⎪ ⎪⎡ ⎤ ⎧ ⎫= =⎨ ⎬ ⎨ ⎬⎢ ⎥⎣ ⎦ ⎩ ⎭⎪ ⎪⎩ ⎭

∫ ∫

∞−

∞−

∞−

∞−

=

==

=

dxxfxfd

xdfdx

d

xdfd

ddx

d

xdf

dxxf

)|()|(

1)|()|(

01)|(

1)|(

θθθ

θ

θ

θθθ

θ

θ

0)|(ln

)|()|(ln

=⎭⎬⎫

⎩⎨⎧=

=∫∞

∞−

θθ

θθ

θ

dxfd

E

dxxfd

xfd

Page 12: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

2 2

2

ln ( | ) ln ( | )( ) ( | )

d f x d f xI f x dx E

d d

θ θθ θθ θ

−∞

⎧ ⎫⎡ ⎤⎪ ⎪⎡ ⎤= =− ⎨ ⎬⎢ ⎥⎢ ⎥⎣ ⎦ ⎪ ⎪⎣ ⎦⎩ ⎭∫

0)|()|(ln)|(ln

)|()|(

1)|(ln)|(ln

)|()|(ln)|(

)|(ln

)|()|(ln

tingdiffrentia

2

2

2

2

2

2

2

=⎥⎥⎦

⎢⎢⎣

⎭⎬⎫

⎩⎨⎧+=

⎥⎦

⎤⎢⎣

⎡+=

⎥⎦

⎤⎢⎣

⎡+

∞−

∞−

∞−

∞−

dxxfd

xfdd

xfd

dxxfxfd

xfdd

xfd

dxdxdf

dxfd

xfd

xfd

dxxfd

xfd

θθ

θθ

θ

θθθ

θθ

θ

θθ

θθθ

θθ

θθ

θ

Page 13: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Define the Fisher information for an i.i.d. sample

[ ]

1 2,

21 2

2

2

1 22

22 21 2

2 2

, i.i.d. sample from p.d.f ( | )

ln ( , , , X | )( )

ln ( | ) ln ( | ) ln ( | )

lln ( | ) ln ( | )

n

nn

n

X , X X f x

d f X XI E

d

dE f X f X f X

d

dd f X d f XE E E

d d

θ

θθ

θ

θ θ θθ

θ θ

θ θ

⋅⋅⋅

⎧ ⎫⋅⋅⋅= − ⎨ ⎬

⎩ ⎭

⎧ ⎫= − + +⎨ ⎬

⎩ ⎭

⎧ ⎫ ⎧ ⎫= − − − −⎨ ⎬ ⎨ ⎬

⎩ ⎭ ⎩ ⎭

L

L 2

n ( | )

( ) ( ) ( ) ( )

nf Xd

I I I nI

θθ

θ θ θ θ

⎧ ⎫⎨ ⎬⎩ ⎭

= + + =L

MLE (Continued)

Page 14: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

1 2p.d.f. of an r.v. is ( | ), where ( , , , )

information matrix of , ( ), is given by

ln ( | ) ln ( | ) ( )

k

iji j

X f x

I

f x f xI E

E

θ θ θ θ θθ θ

θ θθθ θ

=

⎧ ⎫⎡ ⎤⎡ ⎤∂ ∂⎪ ⎪= ⎢ ⎥⎨ ⎬⎢ ⎥∂ ∂⎢ ⎥⎣ ⎦⎪ ⎪⎣ ⎦⎩ ⎭

∂=−

L

2 ln ( | )

i j

f x θθ θ

⎧ ⎫⎪ ⎪⎨ ⎬

∂ ∂⎪ ⎪⎩ ⎭

• Generalization of the Fisher information for

k-dimensional vector parameter

MLE (Continued)

Page 15: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

• Cramér-Rao Lower Bound

A random sample X1, X2, …, Xn from p.d.f f(x|

θ).

Let be any estimator of θ with where B(θ) is the bias of If B(θ) is differentiable in θ and if certain regularity conditions holds, then

(Cramér-Rao inequality)

The ratio of the lower bound to the variance of any estimator of θ is called the efficiency of the estimator.

An estimator has efficiency = 1 is called the efficient estimator.

θ̂ ),()ˆ( θθθ BE +=.̂θ

[ ]21 ( )ˆ( )( )

BVar

nI

θθ

θ

′+≥

MLE (Continued)

Page 16: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Large sample inference on unknown parameter θ

estimate

100(1-α)% CI for θ

)(1)( ˆ

θθ

nIVar =

2

21 ˆ

ln ( | )1ˆ( )n

i

i

d f XI

dn θ θ

θθθ= =

⎡ ⎤=− ∑⎢ ⎥

⎣ ⎦

)ˆ(

1ˆ)ˆ(

1ˆ22 θ

θθθ

θ αα

nIz

nIz +≤≤−

15.1.4 Large Sample Inference Based on the MLE’s

Page 17: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

15.1.4 Delta Method for Approximating the Variance of an Estimator Delta method

estimate a nonlinear function h(θ)

suppose that and is a known function of θ.

using

ˆexpand ( ) around using first-order taylor seriesh θ θ

ˆ( )E θ θ; ˆ( )Var θ

)()ˆ()()ˆ( θθθθθ hhh ′−+≅

ˆ( ) 0,E θ θ− ; [ ] [ ] )ˆ()()ˆ( 2 θθθ VarhhVar ′≅

Page 18: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

15.2 Likelihood Ratio

Tests

Page 19: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

15.2 Likelihood Ratio TestsThe last section presented an inference for pointwise

estimation based on likelihood theory. In this section, we

present a corresponding inference for testing hypotheses.

Let be a probability density function where is a real

valued parameter taking values in an interval that could be

the whole real line. We call the parameter space. An

alternative hypothesis will restrict the parameter to some

subset of the parameter space . The null hypothesis is

then the complement of with respect to .

f (x;θ )

θ

θ

Θ

Θ

H1

Θ1

H 0

Θ

θ

Page 20: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

We will test versus on the

basis of the random sample from

. If the null hypothesis holds, we

would expect the likelihood

to be

relatively large, when evaluated at the

prevailing value . Consider the ratio of

two likelihood functions, namely

Note that , but if is true should be close to 1; while if is true, should be smaller. For a specified significance level , we have the decision rule, reject in favor of if , where c is such that

This test is called the likelihood ratio test.

versus , where is a specified value.

• Consider the two-sided hypothesis

01 : θθ ≠H

nXXX ,....,, 21

0H 1H);( θxf

∏=

=n

iixfL

1

);()( θθ

)ˆ(

)( 0

θθ

λL

L=

1≤λ 0H λ1H λ

α 0H1H c≤λ ][

0cP ≤= λα θ

00 : θθ =H

Page 21: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Let be a random sample of size n from a normal distribution with known variance. Obtain the likelihood ratio for testing versus .

Example 1nXXX ,....,, 21

00 : μμ =H 01 : μμ ≠H

( ) 2

2

2

)(

1 212

1,.....,/ σ

μ

πσμ

−−

=∏=ix

n

in eXXL2

1

2

2

)(

22 )2( σ

μ

πσ∑

==

−−−

n

iix

n

e

)(ln μL2

2

2

2)()2ln(

2 σμ

πσ ∑ −−

−= ixn

.0)(

)(ln2

=−

=∂∂ ∑

σμ

μμ

ixL x=μ̂ So is a maximum since

22

2 1)(ln

σμ

μ−

=∂∂

L x=μ̂ μ< 0 . Thus is the MLE of .

Page 22: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

22

2)(

22

22

2)0(

22

)2((

)2(0

)ˆ(

)(

σ

σ

μ

πσ

πσ

μμ

λ ∑

∑==

−−−

−−−

xixn

ixn

e

e

LL

22

2)(2)0(

σ

μ∑ −−−−

=

xixix

e22

2)(2)]0()[(

σ

μ∑ −−−+−−

=

xixxxix

e

22

2)(2)0()0)((22)(

σ

μμ xixxxxixxix

e

−−−+−−+−−∑=

22

2)0(

σ

μ∑ −−

=

x

e22

2)0(

σ

μ−−

=

xn

e2

20z

e

=

c≤λthus is equivalent to .

Example 1 (continued)

e

− z02

2 c≤, or

z0

2≥ c

*

So α=⎟⎟⎠⎞⎜⎜⎝⎛≥**0czP thus 2/**αzc=

Page 23: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Let be a random sample from a Poisson distribution

with mean >0. a. Show that the likelihood ratio test of versus is based upon the statistic .

Obtain the null distribution of Y.

Example 2nXXX ,....,, 21

θ00 : θθ =H

01 : θθ ≠H ∑= ixY

( ) == −=∏ θθθ ex

Ln

ii

xi

1 ! ∏−∑

!i

n

x

eix θθ

∑ ∑−−= !lnln)(ln ii xnxL θθθ

0)(

)(ln =−=∂∂ ∑ n

xL i

θθ

θ

x=θ̂

ˆˆ1

|)(

)(ln2ˆ22

2

<−

=⋅−

=−

=∂∂

=∑

θθ

θθθ

θ θθ

nn

xL i θ̂ θ

So is a maximum since

thus is the mle of

Page 24: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

)ˆ(

)( 0

θθ

λL

L=

∏−

!

ˆ

00

i

n

i

n

xe

xe

ix

ix

θ

θ

θ

θ

0ˆ0

ˆθθ

θθ nn

x

ei

−∑

⎟⎠

⎞⎜⎝

00 θθ nxx

i

i

i

ex

n −∑∑

⎟⎟⎠

⎞⎜⎜⎝

∑ ix 0H

nXXX ,....,, 21 )(~)(~ 00 θθ nPoissonYPoisson ⇒

The likelihood ratio test statistic is:

= =

=

And it’s a function of Y = . Under

,

Example 2 (continued)

Page 25: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

b. For = 2 and n = 5, find the significance level of the test that rejects if or .

0θ0H 4≤y 17≥y

Example 2 (continued)

)16(1)4()17()4(0000

≤−+≤=≥+≤= YPYPYPYP HHHHα

056.973.1029. =−+=α

The null distribution of Y is Poisson(10).

Page 26: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

. So we take the maximum of the likelihood over

Since the null hypothesis is composite, it isn’t certain which value of the parameter(s) prevails even under

The likelihood ratio approach has to be modified slightly when the null hypothesis is composite. When testing the null hypothesis 00 : μμ =H

2σ}0,:),{( 22 ∞<<∞<<−∞=Θ σμσμ 2R

}0,:),{( 20

20 ∞<<==Θ σμμσμ

0H

)ˆ(max

)(max

0

0 0

θθ

λθ

θ

L

L

Θ∈

Θ∈=

concerning a normal mean when is unknown, the parameter space

is a subset of

The null hypothesis is composite and

The generalized likelihood ratio test statistic is defined as

Composite Null Hypothesis

Page 27: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

be a random sample of size n from a normal distribution with unknown mean and variance. Obtain the likelihood ratio test statistic for testing

Example 3 nXXX ,....,, 21

00 : σσ =H01 : σσ ≠H

Let

versus

),( 20σμθ = },{ 2

02

0 σσμ =∞<<−∞=Θ

x=μ̂

( ) 2

2

2

)(

1 212

2

1,.....,/, σ

μ

πσσμ

−−

=∏=ix

n

in eXXL 21

2

2

)(

22 )2( σ

μ

πσ∑=

−−−

n

iix

n

e

In Example 1, we found the unrestricted mle:

Now

=

∑ ∑ −<−= =

n

i

n

iii xxx

1 1

22 )()( μ );,();,( 22 xLxxL σμσ >

2σ ).;,( 2 xxL σ

Since

,

we only need to find the value of maximizing

Page 28: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

0)ˆ(2

|)(

2 4ˆ6

2

4 22 <−

=−

−=

∑σσσ σσ

nxxn i

=),(ln 2σμL2

2

2

2)()2ln(

2 σμ

πσ ∑ −−

− ixn

02

)(

2),(ln

4

2

22

2=

−+

−=

∂∂ ∑

σσσ

σxxn

xL i

n

xxi∑ −=

22 )(

σ̂

=∂∂

),(ln)(

222

2

σσ

xL

x=μ̂ .μ

n

xxi∑ −=

22 )(

σ̂ 2σ

So is a maximum since

Thus is the MLE of

Thus is the MLE of

.

We can also write n

sn

n

xxi22

2 )1()(ˆ

−=

−=∑σ

Example 3 (continued)

Page 29: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

222

)1(2

)1(22

)1(2

)(

22

)1(2

)1(2)1(2)ˆ(

2

2

21

2

nn

sn

snnn

sn

xxnn

en

sn

en

sne

n

snL

n

ii

−−

−−−

−−−

⎥⎦

⎤⎢⎣

⎡ −=

⎥⎦

⎤⎢⎣

⎡ −=

⎥⎦

⎤⎢⎣

⎡ −=

=

π

ππθ

2

2

2

)1(

220 )2()ˆ( o

snn

o eL σπσθ−−−

=

==)ˆ(

)( 0

θ

θλ

L

L

Example 3 (continued)

Page 30: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

c≤λ ][0

cPH ≤= λα

keuun

22

=λ )1(~)1( 22

2

−−

= nsn

uo

χσ

22)(un

euuh−

=

0,)(2

1

2

1

2)(

21

2

2221

2'

==⇒−=

−=

−−

−−−

unuuneu

eueun

uh

un

unun

c≤λ 1cu ≤ 2cu ≥

αχ −=<< − 1)( 22

110ccP nH

Rejection region: , such that

so where

define and

So implies or

where

Example 3 (continued)

Page 31: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

15.3 : Bayesian Inference

Thomas Bayes (pictured above) was a Presbyterian minister and a mathematician born in London who developed a special case of Bayes’ theorem which was published and studied after his death.

Bayesian inference refers to a statistical inference where new facts are presented and used draw updated conclusions on a prior belief. The term ‘Bayesian’ stems from the well known Bayes Theorem which was first derived by Reverend Thomas Bayes.

Thomas Bayes (c. 1702 – April 17, 1761)Source: www.wikipedia.com

Bayes’ Theorem (review): (15.1)f (A|B) = f (A ∩ B) / f (B) = f (B | A) f (A) / f(B)

since, f (A ∩ B)= f (B ∩ A) = f (B | A) f (A)

Page 32: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Some Key Terms in Bayesian Inference…•prior distribution – probability tendency of an uncertain quantity, θ, that expresses previous knowledge of θ from, for example, a past experience, with the absence of some proof

•posterior distribution – this distribution takes proof into account and is then the conditional probability of θ. The posterior probability is computed from the prior and the likelihood function using Bayes’ theorem.

•posterior mean – the mean of the posterior distribution

•posterior variance – the variance of the posterior distribution

•conjugate priors - a family of prior probability distributions in which the key property is that the posterior probability distribution also belongs to the family of the prior probability distribution

…in plain English

Page 33: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

15.3.1 Bayesian EstimationSo far we’ve learned that the Bayesian approach treats θ as a random variable and then data is used to update the prior distribution to obtain the posterior distribution of θ. Now lets move on to how we can estimate parameters using this approach.(Using text notation)

Let θ be an unknown parameter based on a random sample, x1, x2, …, xn from a distribution with pdf/pmf f (x | θ).

Let π (θ) be the prior distribution of θ.

Let π *(θ | x1, x2, …, xn) be the posterior distribution.

**Note that π *(θ | x1, x2, …, xn) is the condition distribution of θ given the observed data, x1, x2, …, xn.

If we apply Bayes Theorem (Eq. 15.1), our posterior distribution becomes:f (x1, x2, …, xn | θ)

π(θ) f (x1, x2, …, xn | θ)π(θ)=f (x1, x2, …, xn | θ) π(θ)

f *(θ | x1, x2, …, xn)

*Note that f *(θ | x1, x2, …, xn) is the marginal PDF of X1, X2, …,Xn

(15.2)

Page 34: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

http://www.stat.berkeley.edu/users/rice/Stat135/Bayes.pdf

So, to get a better idea of the posterior distribution, we note that:

Bayesian Estimation (continued)As seen in equation 15.2, the posterior distribution represents what is known about θ after observing the data X = x1, x2, …, xn . From earlier chapters, we know that the likelihood of a variable θ is f (X | θ) .

posterior distribution likelihood x prior distribution

i.e. π *(θ | X) f (X | θ) x π (θ)

For a detailed practical example of deriving the posterior mean and using Bayesian estimation, visit:

Page 35: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Example 15.26Let x be the number of successes from n i.i.d. Bernoulli trials with unknown success probability p=θ. Show that the beta distribution is a conjugate prior on θ.

f (x |θ)π (θ) = f (x,θ)

f (x) = f (x,θ)dθ−∞

∫ = f (x |θ)π (θ)dθ−∞

★Goal

π *(θ) = π (θ | x) =f (x,θ)

f (x)=

f (x |θ)π (θ)

f (x |θ)π (θ)dθ∫

Page 36: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

π (θ) =Γ(a+ b)

Γ(a)Γ(b)θ a−1(1−θ)b−1

Example 15.26 (continued)X has a binominal distribution of n and p= θ

f (x |θ) = (xn )θ x (1−θ)n−x

Prior distribution of θ is the beta distribution

f (x,θ) = f (x |θ)π (θ) = (xn )

Γ(a+ b)

Γ(a)Γ(b)θ a−1(1−θ)n−x+b−1

f (x) = f (x,θ)dθ = (xn )

Γ(a+ b)

Γ(a)Γ(b)

Γ(a+ x)Γ(n + b− x)

Γ(n + a+ b)0

1

x=1,2…,n

0≤ θ ≥1

Page 37: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Example 15.26 (continued)

π *(θ) = π (θ | x) =f (x,θ)

f (x)

=Γ(n + a+ b)

Γ(x + a)Γ(n − x + b)θ x+a−1(1−θ)n−x+b−1

It is a beta distribution with parameters (x+a) and (n-x+b)!!

Page 38: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Notes:1. The parameters a and b of the prior distribution may be interpreted as prior successes and prior failures, with m=a+b being the total number of prior observations. After actually observing x successes and n-x failures in n i.i.d Bernoulli trials, these parameters are updated to a+x and b+n-x, respectively.

2. The prior and posterior means are, respectively,

a

mand

a+ x

m + n

Page 39: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

15.3.2 Bayesian Testing

H0 :θ =θ0

Ha :θ =θa

Assumption:

π 0* = π *(θ0) = P(θ =θ0 | x)

π a* = π *(θa ) = P(θ =θa | x)

π 0* + π a

* =1

If

π1*

π 0*

> k, we reject in favor of .

H0

Ha

Where k >0 is a suitably chosen critical constant.

Page 40: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Abraham Wald

(1902-1950)

was the founder of

Statistical decision theory.

His goal was to

provide a unified

theoretical framework

for diverse problems.

i.e. point estimation,

confidence interval

estimation and hypothesis testing.

                                      

Source: http://www-history.mcs.st-andrews.ac.uk/history/PictDisplay/Wald.html

Page 41: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

The goal: is to choose a decision d from a set of possible decisions D, based on a sample outcome (data) x

Decision space is D

Sample space: the set of all sample outcomes denoted by x

Decision Rule: δ is a function δ(x) which assigns to every sample outcome x є X, a decision d є D.

Statistical Decision Problem

Page 42: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Continued… Denote by X the R.V. corresponding to x and the

probability distribution of X by f (x|θ).

The above distribution depends on an unknown parameter θ belonging to a parameter space Θ

Suppose one chooses a decision d when the true parameter is θ, a loss of L (d, θ) is incurred also known as the loss function.

The decision rule is assessed by evaluating its expected loss called the risk function:

R(δ, θ) = E[L(δ(X),θ)] = ∫xL(δ(X),θ) f (x|θ)dx.

Page 43: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Example Calculate and compare the

risk functions for the squared error loss of two estimators of success probability p from n i.i.d. Bernoulli trials. The first is the usual sample proportion of successes and the second is the bayes estimator from Example 15.26:

ṗ1 = X/n

and

ṗ2 = a + X/ m + n

Page 44: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Von Neumann (1928): Minimax

Source:http://jeff560.tripod.com/

Page 45: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

How Minimax Works

Focuses on risk avoidance

Can be applied to both zero-

sum and non-zero-sum games

Can be applied to multi-stage

games

Can be applied to multi-person

games

Page 46: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Classic Example: The Prisoner’s Dilemma

Each player evaluates his/her alternatives, attempting to minimize his/her own risk

From a common sense standpoint, a sub-optimal equilibrium results

Prisoner B Stays Silent

Prisoner B Betrays

Prisoner A Stays Silent

Both serve six months

Prisoner A serves ten

years

Prisoner B goes free

Prisoner A

Betrays

Prisoner A goes free

Prisoner B serves ten

years

Both serve two years

Page 47: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

Classic example: With Probabilities

When disregarding the probabilities when playing the game, (D,B) is the equilibrium point under minimax

With probabilities (p=q=r=1/4), player one will choose B. This is…

Two player game with simultaneous moves, where the probabilities with which player two acts are known to both players.

2

1Action A

[P(A)=p]

Action B

[P(B)=q]

Action C

[P(C)=r]

Action D

[P(D)=1-p=q=r]

Action A -1 1 -2 4

Action B -2 7 1 1

Action C 0 -1 0 3

Action D 1 0 2 3

Page 48: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

…how Bayes works

View {(pi,qi,ri)} as

θi where i=1 in the

previous example

Letting i=[1,n] we get a much better idea of what Bayes meant by “states of nature” and how probabilities of each state enter into one’s strategy

Page 49: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

ConclusionWe covered three theoretical approaches in our

presentation

Likelihood provides statistical justification for many of the

methods used in statistics MLE - method used to make inferences about parameters of

the underlying probability distribution of a given data set

Bayesian and Decision Theory paradigms used in statistics

Bayesian Theory probabilities are associated with individual event or statements rather than with sequences of events

Decision Theory Describe and rationalize the process of decision making, that is, making a choice of among several possible alternatives

Source: http://www.answers.com/maximum%20likelihood, http://www.answers.com/bayesian%20theory, http://www.answers.com/decision%20theory

Page 50: Chapter 15: Likelihood, Bayesian, and Decision Theory AMS 572 Group Members Yen-hsiu Chen, Valencia Joseph, Lola Ojo, Andrea Roberson, Dave Roelfs, Saskya

The End Any questions for the group?