3lect handout

7/25/2019 3lect Handout

1/34

STA 250/MTH 342 Intro to Mathematical

Statistics

Lecture 3

1 / 3 4


2/34

HW2 due on Thursday 9/11 at the beginning of class.

2 / 3 4


3/34

Last class

Bayes Theorem and theidealapproach to inference.

Example: A political poll (a binomial experiment).

3 / 3 4


4/34

Political poll example revisited

An organization randomly selected 100 democrats and interviewed

them about whether they support the incombent governer. Out of the100 polled, 40 support the governer and 60 against. What can we say

about the actual support rate of the governer?

Our model (or assumptions/hypotheses) are

(1) Given,Xis Binomial(100,).

(2) is Uniform(0,1)a priorirepresenting our knowledge before

data is observed.

Based on these two pieces, Bayes Theorem allows us to use the

observed data to update our knowledge about the parameter.

(3) is Beta(41,61)a posteriorirepresenting our updated

knowledge after data is observed.

4 / 3 4


5/34

Our analysis up to this point uses Uniform(0,1) to represent our

prior knowledge about.

More flexible choices for ()can allow richer prior knowledgeabout the parameter to be incorporated into the analysis.

5 / 3 4


6/34

A richer class of priors for Binomial experiments

A flexible and convenient choice of the prior distribution foris the

Beta(,)family of distributions.

() = (+)

()()1(1)1 for 0


7/34

Source: Wikipedia.

http://upload.wikimedia.org/wikipedia/commons/

f/f3/Beta_distribution_pdf.svg.

7 / 3 4
http://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svghttp://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svghttp://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svghttp://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svg


8/34

Bayesian inference using a Beta(,) prior

(|x) p(x|) ()

=

n

x

x(1)nx (+)

()()1(1)1

=Cx+1(1)nx+1 for 0


9/34

A trick to get the posterior density

The part that depends on

x+1(1)nx+1 for 0


10/34

A quick sum up

To make inference on a Binomial experiment using a Beta prior, we

can proceed as follows

1. We model the distribution of the number of successesX

conditional on the success probabilityas Binomial(n,).

2. We can then model the prior distribution ()as a Beta(,)density. The value ofandare chosen to represent out prior

knowledge about the parameter.

3. After observingX= x, Bayes theorem tells us that the posterior

distribution ofis Beta(x +,nx +).

10/34


11/34

The three steps in all Bayesian statistical analysis

This example illustrates the general process ofall Bayesian statisticalanalysis.

1. Model the distribution of dataXconditional on a set of

parameter.

2. Choose a prior distribution ()for the parameters, and

3. Apply Bayes Theorem to find the posterior distribution (|x).This posterior distribution is aprobabilisticsummary of our

knowledge about the parameters given the data. It allows us to make

probabilistic statements about the parameters such as follows.

Given the data, the probability for to be in (0.7,0.8) is . . . .

Given the data, the expected value ofis . . . .

11/34


12/34

Choosing prior distribution to represent prior knowledge

Suppose before the experiment is carried out, from historical

background, we think that the actual proportion of supportors is about

0.5, give or take 0.1.

Here by give or take I mean that we are willing to assume that

prior standard deviation ofis about 0.1. (Note that this is a

probabilistic statement about!)

Whichandvalues should we choose so that our prior Beta(,)

will represent this knowledge?

12/34


13/34

We can choose the values forandso that

= +

=.5

and

2=(1)++ 1

=.12.

Solving these two equations we get

==12.

Therefore we choose Beta(12,12) distribution as our prior distributionfor.

13/34


14/34

From prior to posterior

0.0 0.2 0.4 0.6 0.8 1.0

0

2

4

6

8

1

0

theta

0.0 0.2 0.4 0.6 0.8 1.0

0

2

4

6

8

1

0

theta

f(theta)=Beta(12,12)f(theta|x)=Beta(52,72)

14/34


15/34

Summarizing the effect of data on the distribution of

The center of the distribution is shifted.

The spread of the distribution is also changed.

Let us quantify these two changes.

15/34


16/34


17/34

Before observing data, the prior distribution is Beta(,). So

E() =

+ =

Var() = (1)++ 1

.

After observing the data,X= x, the posterior distribution isBeta(+x,+ nx), and so

E(|X= x) = +x++ n

=

+

++ n

+

n

++ n

x

n=|x

and

Var(|X= x) = |x(1|x)++ n + 1

.

17/34


18/34

An interpretation of the posterior mean

E(|X= x) =

+

++ n

+

n

++ n

x

n

This is a weighted average of the prior mean, and theobserved averagex/n.

This is a compromise between ourprior expectation(when there

is no data) and theobserved average.

When we have a lot of data, i.e. n is very large compared to+, the observed averagex/ndominatesE(|X= x) x/n.

18/34

A i i f h i


19/34

An interpretation of the posterior mean

E(|X= x) =

+

++ n

+

n

++ n

x

n

When+is much larger thann, then the prior expectationdonminatesE(|X= x) .

+is the prior sample sizeit measures the amount ofinformation we have about.

nis the observed sample sizeit measures the amount ofinformation we have in the data.

19/34

A f h i


20/34

As for the variance

Before the data are observed,

Var() = (1)++ 1

For the same, the larger+is, the smaller Var()iswe

are more certain about the value of.After the data are observed,

Var(|X= x) = |x(1|x)++ n + 1

.

For the same|x, the larger++ ntotal amount ofinformation, the smaller Var(|X= x).

20/34

Th l li bilit f B Th


21/34

The general applicability of Bayes Theorem

Recall the two pieces we need to apply Bayes Theorem

1. The distribution of the dataconditionalon the state of nature

f(x|).2. Thepriordistribution ()on the state of nature.

Bayes Theorem provide a probabilistic recipe for inference through

learning theposteriordistribution of the state of nature given the data, (|x). We have seen a particular example where ()is Beta(,) and

f(x|)is Binomial.

This scheme for inference applies generally toall choices off(x|)and ().

Let us now look at another example.

21/34

E l M i i lit ith i f t d i


22/34

Example: Measuring air quality with an imperfect device

Suppose we want to measure the amount (in terms of density) of acertain pollutant in the air.

We have a device that is accurate on average.

However it may in any single reading be off by an unpredictable

amount, due to factors such as temperature, humidity, and the

way the device is held, etc.

In formal terms, let

be the actual amount of the pollutant in the air.

Xbe a reading of the device, which givenis a random quantity

that is close but not exactly.

Goal of inference: What is?

22/34

The Ba esian proced re


23/34

The Bayesian procedure

Let us carry out a Bayesian inference just as we did for the political

poll example. Again, we need to specify the two pieces:

1. The conditional distribution of the data given the parameter (or

state of nature).

2. Apriordistribution ()for the state of nature that summarizesour knowledge aboutbefore observing data.

23/34


24/34

What may be a good model for the reading of the device given ?

We canmodelthe error made by the device on each reading asnormally distributed with standard deviation. For simplicity, let

us assume that this is known. The users manual will typically

give the technical specs of the device including its precision.

For example,=2. Under this model, the conditional density

forX= x given is

f(x|) = 12

e (x)2

22 for


25/34

Choosing the prior distribution for

We canmodelour prior knowledge of the actual amount also as aNormal(,2) distribution.

The mean of the prior, , can be chosen as the historical average

amount, e.g.=100.

We can chooseso that the historical records of the amount fallin the range ofabout 2/3 of the times.For example, if historically the amount of the pollutant falls in the

range 10010 about 2/3 of the time, then we may choose

=100 and =10.

25/34


26/34

Now that we have the two pieces, inference again is an application of

Bayes theorem.

(|x) ()f(x|)

= 1

2e ()2

22 12

e (x)2

22

= 1

2e[ ()2

22 +

(x)222

]

e 1

2[

()22

+(x)2

2 ].

What is this distribution?

26/34


27/34

Note that

()22

+(x)2

2

=2( 1

2+

1

2)2(

2+

x

2) +

2

2+

x2

2

=

1

2+

1

2

22

/2 +x/2

1/2 + 1/2

+

/2 +x/2

1/2 + 1/2

2+ Const

=

1

2+

1

2

/

2 +x/2

1/2 + 1/2

2+ Const

=( )2

2 + Const

where

=/2 +x/2

1/2 + 1/2 and 2 =

1

1/2 + 1/2.

27/34


28/34

Thus

(|x) e()2

22 for


29/34

Note that the posterior expectation is again a weighted average of the

prior expectationand the observed datax.

E(|X= x) =

1/2

1/2 + 1/2

+

1/2

1/2 + 1/2

x.

The weights are determined by the relative sizes of 1/2 (which

measures the amount of information aboutin the prior) and 1/2(which measures the amount of information aboutin the data).

For fixed2, as2 0, we haveE(|X= x) andVar(|X= x) 0. We area prioricertain that=.

For fixed2

, as2

0, we haveE(|X= x) xandVar(|X= x) 0. We have a perfect device.

29/34


30/34

The posterior variance

Var(|X= x) = 2 = 11/2 + 1/2

is smaller than both2 and2. We can think of 1/2 as a measure of

the information we have about. The total informaion is the sum ofprior information and the information from data:

1

2=

1

2+

1

2.

30/34

For our ongoing example


31/34

For our ongoing example

Suppose we observeX=105 and we know that=2, then

E(|X=105) = 1/102

1/102 + 1/22100 + 1/2

2

1/102 + 1/22105=104.8

Var(

|X=105) =1/(1/102 + 1/22) =3.85.

If instead=1, (so we are a lot more certaina priori),

E(|X=105) = 1/12

1/12 + 1/22100 + 1/2

2

1/12 + 1/22105=101

Var(|X=105) =1/(1/12 + 1/22) =0.89.

31/34

Bayes Theorem for multiple independent observations


32/34

Bayes Theorem for multiple independent observations

Now suppose instead of taking one reading from the device, we

takenindependentreadingsX1,X2, . . . , Xn. That is, conditional

on, theXis areindependent identitically distributed(i.i.d.)N(,2)random variables. That is

f(x1,x2, . . . ,xn|) =n

i=1

f(xi|) = 1(2)n/2n

ePn

i=1(xi)222 .

Then Bayes Theorem applies just as before:

(|x1,x2, . . . ,xn) = f(x1,x2, . . . ,xn|) ()f(x1,x2, . . . ,xn|u) (u)duf(x1,x2, . . . ,xn|) ()= ()

ni=1

f(xi|).

32/34

Exercise


33/34

If we still useN(,2)as the prior for the parameter, show that

givenX1= x1,X2= x2, . . . ,Xn= xn, the posterior distribution for isN(, 2)with

=

1/2

1/2 + n/2

+

n/2

1/2 + n/2

x

where

x=

ni=1xi

n

is the sample average, and

2 = 1

1/2 + n/2.

33/34

Next ...


34/34

Conjugate models.

Point estimation.

34/34

3lect handout

Documents