3lect handout
TRANSCRIPT
-
7/25/2019 3lect Handout
1/34
STA 250/MTH 342 Intro to Mathematical
Statistics
Lecture 3
1 / 3 4
-
7/25/2019 3lect Handout
2/34
HW2 due on Thursday 9/11 at the beginning of class.
2 / 3 4
-
7/25/2019 3lect Handout
3/34
Last class
Bayes Theorem and theidealapproach to inference.
Example: A political poll (a binomial experiment).
3 / 3 4
-
7/25/2019 3lect Handout
4/34
Political poll example revisited
An organization randomly selected 100 democrats and interviewed
them about whether they support the incombent governer. Out of the100 polled, 40 support the governer and 60 against. What can we say
about the actual support rate of the governer?
Our model (or assumptions/hypotheses) are
(1) Given,Xis Binomial(100,).
(2) is Uniform(0,1)a priorirepresenting our knowledge before
data is observed.
Based on these two pieces, Bayes Theorem allows us to use the
observed data to update our knowledge about the parameter.
(3) is Beta(41,61)a posteriorirepresenting our updated
knowledge after data is observed.
4 / 3 4
-
7/25/2019 3lect Handout
5/34
Our analysis up to this point uses Uniform(0,1) to represent our
prior knowledge about.
More flexible choices for ()can allow richer prior knowledgeabout the parameter to be incorporated into the analysis.
5 / 3 4
-
7/25/2019 3lect Handout
6/34
A richer class of priors for Binomial experiments
A flexible and convenient choice of the prior distribution foris the
Beta(,)family of distributions.
() = (+)
()()1(1)1 for 0
-
7/25/2019 3lect Handout
7/34
Source: Wikipedia.
http://upload.wikimedia.org/wikipedia/commons/
f/f3/Beta_distribution_pdf.svg.
7 / 3 4
http://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svghttp://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svghttp://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svghttp://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svg -
7/25/2019 3lect Handout
8/34
Bayesian inference using a Beta(,) prior
(|x) p(x|) ()
=
n
x
x(1)nx (+)
()()1(1)1
=Cx+1(1)nx+1 for 0
-
7/25/2019 3lect Handout
9/34
A trick to get the posterior density
The part that depends on
x+1(1)nx+1 for 0
-
7/25/2019 3lect Handout
10/34
A quick sum up
To make inference on a Binomial experiment using a Beta prior, we
can proceed as follows
1. We model the distribution of the number of successesX
conditional on the success probabilityas Binomial(n,).
2. We can then model the prior distribution ()as a Beta(,)density. The value ofandare chosen to represent out prior
knowledge about the parameter.
3. After observingX= x, Bayes theorem tells us that the posterior
distribution ofis Beta(x +,nx +).
10/34
-
7/25/2019 3lect Handout
11/34
The three steps in all Bayesian statistical analysis
This example illustrates the general process ofall Bayesian statisticalanalysis.
1. Model the distribution of dataXconditional on a set of
parameter.
2. Choose a prior distribution ()for the parameters, and
3. Apply Bayes Theorem to find the posterior distribution (|x).This posterior distribution is aprobabilisticsummary of our
knowledge about the parameters given the data. It allows us to make
probabilistic statements about the parameters such as follows.
Given the data, the probability for to be in (0.7,0.8) is . . . .
Given the data, the expected value ofis . . . .
11/34
-
7/25/2019 3lect Handout
12/34
Choosing prior distribution to represent prior knowledge
Suppose before the experiment is carried out, from historical
background, we think that the actual proportion of supportors is about
0.5, give or take 0.1.
Here by give or take I mean that we are willing to assume that
prior standard deviation ofis about 0.1. (Note that this is a
probabilistic statement about!)
Whichandvalues should we choose so that our prior Beta(,)
will represent this knowledge?
12/34
-
7/25/2019 3lect Handout
13/34
We can choose the values forandso that
= +
=.5
and
2=(1)++ 1
=.12.
Solving these two equations we get
==12.
Therefore we choose Beta(12,12) distribution as our prior distributionfor.
13/34
-
7/25/2019 3lect Handout
14/34
From prior to posterior
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
theta
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
theta
f(theta)=Beta(12,12)f(theta|x)=Beta(52,72)
14/34
-
7/25/2019 3lect Handout
15/34
Summarizing the effect of data on the distribution of
The center of the distribution is shifted.
The spread of the distribution is also changed.
Let us quantify these two changes.
15/34
-
7/25/2019 3lect Handout
16/34
-
7/25/2019 3lect Handout
17/34
Before observing data, the prior distribution is Beta(,). So
E() =
+ =
Var() = (1)++ 1
.
After observing the data,X= x, the posterior distribution isBeta(+x,+ nx), and so
E(|X= x) = +x++ n
=
+
++ n
+
n
++ n
x
n=|x
and
Var(|X= x) = |x(1|x)++ n + 1
.
17/34
-
7/25/2019 3lect Handout
18/34
An interpretation of the posterior mean
E(|X= x) =
+
++ n
+
n
++ n
x
n
This is a weighted average of the prior mean, and theobserved averagex/n.
This is a compromise between ourprior expectation(when there
is no data) and theobserved average.
When we have a lot of data, i.e. n is very large compared to+, the observed averagex/ndominatesE(|X= x) x/n.
18/34
A i i f h i
-
7/25/2019 3lect Handout
19/34
An interpretation of the posterior mean
E(|X= x) =
+
++ n
+
n
++ n
x
n
When+is much larger thann, then the prior expectationdonminatesE(|X= x) .
+is the prior sample sizeit measures the amount ofinformation we have about.
nis the observed sample sizeit measures the amount ofinformation we have in the data.
19/34
A f h i
-
7/25/2019 3lect Handout
20/34
As for the variance
Before the data are observed,
Var() = (1)++ 1
For the same, the larger+is, the smaller Var()iswe
are more certain about the value of.After the data are observed,
Var(|X= x) = |x(1|x)++ n + 1
.
For the same|x, the larger++ ntotal amount ofinformation, the smaller Var(|X= x).
20/34
Th l li bilit f B Th
-
7/25/2019 3lect Handout
21/34
The general applicability of Bayes Theorem
Recall the two pieces we need to apply Bayes Theorem
1. The distribution of the dataconditionalon the state of nature
f(x|).2. Thepriordistribution ()on the state of nature.
Bayes Theorem provide a probabilistic recipe for inference through
learning theposteriordistribution of the state of nature given the data, (|x). We have seen a particular example where ()is Beta(,) and
f(x|)is Binomial.
This scheme for inference applies generally toall choices off(x|)and ().
Let us now look at another example.
21/34
E l M i i lit ith i f t d i
-
7/25/2019 3lect Handout
22/34
Example: Measuring air quality with an imperfect device
Suppose we want to measure the amount (in terms of density) of acertain pollutant in the air.
We have a device that is accurate on average.
However it may in any single reading be off by an unpredictable
amount, due to factors such as temperature, humidity, and the
way the device is held, etc.
In formal terms, let
be the actual amount of the pollutant in the air.
Xbe a reading of the device, which givenis a random quantity
that is close but not exactly.
Goal of inference: What is?
22/34
The Ba esian proced re
-
7/25/2019 3lect Handout
23/34
The Bayesian procedure
Let us carry out a Bayesian inference just as we did for the political
poll example. Again, we need to specify the two pieces:
1. The conditional distribution of the data given the parameter (or
state of nature).
2. Apriordistribution ()for the state of nature that summarizesour knowledge aboutbefore observing data.
23/34
-
7/25/2019 3lect Handout
24/34
What may be a good model for the reading of the device given ?
We canmodelthe error made by the device on each reading asnormally distributed with standard deviation. For simplicity, let
us assume that this is known. The users manual will typically
give the technical specs of the device including its precision.
For example,=2. Under this model, the conditional density
forX= x given is
f(x|) = 12
e (x)2
22 for
-
7/25/2019 3lect Handout
25/34
Choosing the prior distribution for
We canmodelour prior knowledge of the actual amount also as aNormal(,2) distribution.
The mean of the prior, , can be chosen as the historical average
amount, e.g.=100.
We can chooseso that the historical records of the amount fallin the range ofabout 2/3 of the times.For example, if historically the amount of the pollutant falls in the
range 10010 about 2/3 of the time, then we may choose
=100 and =10.
25/34
-
7/25/2019 3lect Handout
26/34
Now that we have the two pieces, inference again is an application of
Bayes theorem.
(|x) ()f(x|)
= 1
2e ()2
22 12
e (x)2
22
= 1
2e[ ()2
22 +
(x)222
]
e 1
2[
()22
+(x)2
2 ].
What is this distribution?
26/34
-
7/25/2019 3lect Handout
27/34
Note that
()22
+(x)2
2
=2( 1
2+
1
2)2(
2+
x
2) +
2
2+
x2
2
=
1
2+
1
2
22
/2 +x/2
1/2 + 1/2
+
/2 +x/2
1/2 + 1/2
2+ Const
=
1
2+
1
2
/
2 +x/2
1/2 + 1/2
2+ Const
=( )2
2 + Const
where
=/2 +x/2
1/2 + 1/2 and 2 =
1
1/2 + 1/2.
27/34
-
7/25/2019 3lect Handout
28/34
Thus
(|x) e()2
22 for
-
7/25/2019 3lect Handout
29/34
Note that the posterior expectation is again a weighted average of the
prior expectationand the observed datax.
E(|X= x) =
1/2
1/2 + 1/2
+
1/2
1/2 + 1/2
x.
The weights are determined by the relative sizes of 1/2 (which
measures the amount of information aboutin the prior) and 1/2(which measures the amount of information aboutin the data).
For fixed2, as2 0, we haveE(|X= x) andVar(|X= x) 0. We area prioricertain that=.
For fixed2
, as2
0, we haveE(|X= x) xandVar(|X= x) 0. We have a perfect device.
29/34
-
7/25/2019 3lect Handout
30/34
The posterior variance
Var(|X= x) = 2 = 11/2 + 1/2
is smaller than both2 and2. We can think of 1/2 as a measure of
the information we have about. The total informaion is the sum ofprior information and the information from data:
1
2=
1
2+
1
2.
30/34
For our ongoing example
-
7/25/2019 3lect Handout
31/34
For our ongoing example
Suppose we observeX=105 and we know that=2, then
E(|X=105) = 1/102
1/102 + 1/22100 + 1/2
2
1/102 + 1/22105=104.8
Var(
|X=105) =1/(1/102 + 1/22) =3.85.
If instead=1, (so we are a lot more certaina priori),
E(|X=105) = 1/12
1/12 + 1/22100 + 1/2
2
1/12 + 1/22105=101
Var(|X=105) =1/(1/12 + 1/22) =0.89.
31/34
Bayes Theorem for multiple independent observations
-
7/25/2019 3lect Handout
32/34
Bayes Theorem for multiple independent observations
Now suppose instead of taking one reading from the device, we
takenindependentreadingsX1,X2, . . . , Xn. That is, conditional
on, theXis areindependent identitically distributed(i.i.d.)N(,2)random variables. That is
f(x1,x2, . . . ,xn|) =n
i=1
f(xi|) = 1(2)n/2n
ePn
i=1(xi)222 .
Then Bayes Theorem applies just as before:
(|x1,x2, . . . ,xn) = f(x1,x2, . . . ,xn|) ()f(x1,x2, . . . ,xn|u) (u)duf(x1,x2, . . . ,xn|) ()= ()
ni=1
f(xi|).
32/34
Exercise
-
7/25/2019 3lect Handout
33/34
If we still useN(,2)as the prior for the parameter, show that
givenX1= x1,X2= x2, . . . ,Xn= xn, the posterior distribution for isN(, 2)with
=
1/2
1/2 + n/2
+
n/2
1/2 + n/2
x
where
x=
ni=1xi
n
is the sample average, and
2 = 1
1/2 + n/2.
33/34
Next ...
-
7/25/2019 3lect Handout
34/34
Conjugate models.
Point estimation.
34/34