3lect handout

Upload: ericqian

Post on 26-Feb-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 3lect Handout

    1/34

    STA 250/MTH 342 Intro to Mathematical

    Statistics

    Lecture 3

    1 / 3 4

  • 7/25/2019 3lect Handout

    2/34

    HW2 due on Thursday 9/11 at the beginning of class.

    2 / 3 4

  • 7/25/2019 3lect Handout

    3/34

    Last class

    Bayes Theorem and theidealapproach to inference.

    Example: A political poll (a binomial experiment).

    3 / 3 4

  • 7/25/2019 3lect Handout

    4/34

    Political poll example revisited

    An organization randomly selected 100 democrats and interviewed

    them about whether they support the incombent governer. Out of the100 polled, 40 support the governer and 60 against. What can we say

    about the actual support rate of the governer?

    Our model (or assumptions/hypotheses) are

    (1) Given,Xis Binomial(100,).

    (2) is Uniform(0,1)a priorirepresenting our knowledge before

    data is observed.

    Based on these two pieces, Bayes Theorem allows us to use the

    observed data to update our knowledge about the parameter.

    (3) is Beta(41,61)a posteriorirepresenting our updated

    knowledge after data is observed.

    4 / 3 4

  • 7/25/2019 3lect Handout

    5/34

    Our analysis up to this point uses Uniform(0,1) to represent our

    prior knowledge about.

    More flexible choices for ()can allow richer prior knowledgeabout the parameter to be incorporated into the analysis.

    5 / 3 4

  • 7/25/2019 3lect Handout

    6/34

    A richer class of priors for Binomial experiments

    A flexible and convenient choice of the prior distribution foris the

    Beta(,)family of distributions.

    () = (+)

    ()()1(1)1 for 0

  • 7/25/2019 3lect Handout

    7/34

    Source: Wikipedia.

    http://upload.wikimedia.org/wikipedia/commons/

    f/f3/Beta_distribution_pdf.svg.

    7 / 3 4

    http://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svghttp://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svghttp://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svghttp://upload.wikimedia.org/wikipedia/commons/f/f3/Beta_distribution_pdf.svg
  • 7/25/2019 3lect Handout

    8/34

    Bayesian inference using a Beta(,) prior

    (|x) p(x|) ()

    =

    n

    x

    x(1)nx (+)

    ()()1(1)1

    =Cx+1(1)nx+1 for 0

  • 7/25/2019 3lect Handout

    9/34

    A trick to get the posterior density

    The part that depends on

    x+1(1)nx+1 for 0

  • 7/25/2019 3lect Handout

    10/34

    A quick sum up

    To make inference on a Binomial experiment using a Beta prior, we

    can proceed as follows

    1. We model the distribution of the number of successesX

    conditional on the success probabilityas Binomial(n,).

    2. We can then model the prior distribution ()as a Beta(,)density. The value ofandare chosen to represent out prior

    knowledge about the parameter.

    3. After observingX= x, Bayes theorem tells us that the posterior

    distribution ofis Beta(x +,nx +).

    10/34

  • 7/25/2019 3lect Handout

    11/34

    The three steps in all Bayesian statistical analysis

    This example illustrates the general process ofall Bayesian statisticalanalysis.

    1. Model the distribution of dataXconditional on a set of

    parameter.

    2. Choose a prior distribution ()for the parameters, and

    3. Apply Bayes Theorem to find the posterior distribution (|x).This posterior distribution is aprobabilisticsummary of our

    knowledge about the parameters given the data. It allows us to make

    probabilistic statements about the parameters such as follows.

    Given the data, the probability for to be in (0.7,0.8) is . . . .

    Given the data, the expected value ofis . . . .

    11/34

  • 7/25/2019 3lect Handout

    12/34

    Choosing prior distribution to represent prior knowledge

    Suppose before the experiment is carried out, from historical

    background, we think that the actual proportion of supportors is about

    0.5, give or take 0.1.

    Here by give or take I mean that we are willing to assume that

    prior standard deviation ofis about 0.1. (Note that this is a

    probabilistic statement about!)

    Whichandvalues should we choose so that our prior Beta(,)

    will represent this knowledge?

    12/34

  • 7/25/2019 3lect Handout

    13/34

    We can choose the values forandso that

    = +

    =.5

    and

    2=(1)++ 1

    =.12.

    Solving these two equations we get

    ==12.

    Therefore we choose Beta(12,12) distribution as our prior distributionfor.

    13/34

  • 7/25/2019 3lect Handout

    14/34

    From prior to posterior

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    2

    4

    6

    8

    1

    0

    theta

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    2

    4

    6

    8

    1

    0

    theta

    f(theta)=Beta(12,12)f(theta|x)=Beta(52,72)

    14/34

  • 7/25/2019 3lect Handout

    15/34

    Summarizing the effect of data on the distribution of

    The center of the distribution is shifted.

    The spread of the distribution is also changed.

    Let us quantify these two changes.

    15/34

  • 7/25/2019 3lect Handout

    16/34

  • 7/25/2019 3lect Handout

    17/34

    Before observing data, the prior distribution is Beta(,). So

    E() =

    + =

    Var() = (1)++ 1

    .

    After observing the data,X= x, the posterior distribution isBeta(+x,+ nx), and so

    E(|X= x) = +x++ n

    =

    +

    ++ n

    +

    n

    ++ n

    x

    n=|x

    and

    Var(|X= x) = |x(1|x)++ n + 1

    .

    17/34

  • 7/25/2019 3lect Handout

    18/34

    An interpretation of the posterior mean

    E(|X= x) =

    +

    ++ n

    +

    n

    ++ n

    x

    n

    This is a weighted average of the prior mean, and theobserved averagex/n.

    This is a compromise between ourprior expectation(when there

    is no data) and theobserved average.

    When we have a lot of data, i.e. n is very large compared to+, the observed averagex/ndominatesE(|X= x) x/n.

    18/34

    A i i f h i

  • 7/25/2019 3lect Handout

    19/34

    An interpretation of the posterior mean

    E(|X= x) =

    +

    ++ n

    +

    n

    ++ n

    x

    n

    When+is much larger thann, then the prior expectationdonminatesE(|X= x) .

    +is the prior sample sizeit measures the amount ofinformation we have about.

    nis the observed sample sizeit measures the amount ofinformation we have in the data.

    19/34

    A f h i

  • 7/25/2019 3lect Handout

    20/34

    As for the variance

    Before the data are observed,

    Var() = (1)++ 1

    For the same, the larger+is, the smaller Var()iswe

    are more certain about the value of.After the data are observed,

    Var(|X= x) = |x(1|x)++ n + 1

    .

    For the same|x, the larger++ ntotal amount ofinformation, the smaller Var(|X= x).

    20/34

    Th l li bilit f B Th

  • 7/25/2019 3lect Handout

    21/34

    The general applicability of Bayes Theorem

    Recall the two pieces we need to apply Bayes Theorem

    1. The distribution of the dataconditionalon the state of nature

    f(x|).2. Thepriordistribution ()on the state of nature.

    Bayes Theorem provide a probabilistic recipe for inference through

    learning theposteriordistribution of the state of nature given the data, (|x). We have seen a particular example where ()is Beta(,) and

    f(x|)is Binomial.

    This scheme for inference applies generally toall choices off(x|)and ().

    Let us now look at another example.

    21/34

    E l M i i lit ith i f t d i

  • 7/25/2019 3lect Handout

    22/34

    Example: Measuring air quality with an imperfect device

    Suppose we want to measure the amount (in terms of density) of acertain pollutant in the air.

    We have a device that is accurate on average.

    However it may in any single reading be off by an unpredictable

    amount, due to factors such as temperature, humidity, and the

    way the device is held, etc.

    In formal terms, let

    be the actual amount of the pollutant in the air.

    Xbe a reading of the device, which givenis a random quantity

    that is close but not exactly.

    Goal of inference: What is?

    22/34

    The Ba esian proced re

  • 7/25/2019 3lect Handout

    23/34

    The Bayesian procedure

    Let us carry out a Bayesian inference just as we did for the political

    poll example. Again, we need to specify the two pieces:

    1. The conditional distribution of the data given the parameter (or

    state of nature).

    2. Apriordistribution ()for the state of nature that summarizesour knowledge aboutbefore observing data.

    23/34

  • 7/25/2019 3lect Handout

    24/34

    What may be a good model for the reading of the device given ?

    We canmodelthe error made by the device on each reading asnormally distributed with standard deviation. For simplicity, let

    us assume that this is known. The users manual will typically

    give the technical specs of the device including its precision.

    For example,=2. Under this model, the conditional density

    forX= x given is

    f(x|) = 12

    e (x)2

    22 for

  • 7/25/2019 3lect Handout

    25/34

    Choosing the prior distribution for

    We canmodelour prior knowledge of the actual amount also as aNormal(,2) distribution.

    The mean of the prior, , can be chosen as the historical average

    amount, e.g.=100.

    We can chooseso that the historical records of the amount fallin the range ofabout 2/3 of the times.For example, if historically the amount of the pollutant falls in the

    range 10010 about 2/3 of the time, then we may choose

    =100 and =10.

    25/34

  • 7/25/2019 3lect Handout

    26/34

    Now that we have the two pieces, inference again is an application of

    Bayes theorem.

    (|x) ()f(x|)

    = 1

    2e ()2

    22 12

    e (x)2

    22

    = 1

    2e[ ()2

    22 +

    (x)222

    ]

    e 1

    2[

    ()22

    +(x)2

    2 ].

    What is this distribution?

    26/34

  • 7/25/2019 3lect Handout

    27/34

    Note that

    ()22

    +(x)2

    2

    =2( 1

    2+

    1

    2)2(

    2+

    x

    2) +

    2

    2+

    x2

    2

    =

    1

    2+

    1

    2

    22

    /2 +x/2

    1/2 + 1/2

    +

    /2 +x/2

    1/2 + 1/2

    2+ Const

    =

    1

    2+

    1

    2

    /

    2 +x/2

    1/2 + 1/2

    2+ Const

    =( )2

    2 + Const

    where

    =/2 +x/2

    1/2 + 1/2 and 2 =

    1

    1/2 + 1/2.

    27/34

  • 7/25/2019 3lect Handout

    28/34

    Thus

    (|x) e()2

    22 for

  • 7/25/2019 3lect Handout

    29/34

    Note that the posterior expectation is again a weighted average of the

    prior expectationand the observed datax.

    E(|X= x) =

    1/2

    1/2 + 1/2

    +

    1/2

    1/2 + 1/2

    x.

    The weights are determined by the relative sizes of 1/2 (which

    measures the amount of information aboutin the prior) and 1/2(which measures the amount of information aboutin the data).

    For fixed2, as2 0, we haveE(|X= x) andVar(|X= x) 0. We area prioricertain that=.

    For fixed2

    , as2

    0, we haveE(|X= x) xandVar(|X= x) 0. We have a perfect device.

    29/34

  • 7/25/2019 3lect Handout

    30/34

    The posterior variance

    Var(|X= x) = 2 = 11/2 + 1/2

    is smaller than both2 and2. We can think of 1/2 as a measure of

    the information we have about. The total informaion is the sum ofprior information and the information from data:

    1

    2=

    1

    2+

    1

    2.

    30/34

    For our ongoing example

  • 7/25/2019 3lect Handout

    31/34

    For our ongoing example

    Suppose we observeX=105 and we know that=2, then

    E(|X=105) = 1/102

    1/102 + 1/22100 + 1/2

    2

    1/102 + 1/22105=104.8

    Var(

    |X=105) =1/(1/102 + 1/22) =3.85.

    If instead=1, (so we are a lot more certaina priori),

    E(|X=105) = 1/12

    1/12 + 1/22100 + 1/2

    2

    1/12 + 1/22105=101

    Var(|X=105) =1/(1/12 + 1/22) =0.89.

    31/34

    Bayes Theorem for multiple independent observations

  • 7/25/2019 3lect Handout

    32/34

    Bayes Theorem for multiple independent observations

    Now suppose instead of taking one reading from the device, we

    takenindependentreadingsX1,X2, . . . , Xn. That is, conditional

    on, theXis areindependent identitically distributed(i.i.d.)N(,2)random variables. That is

    f(x1,x2, . . . ,xn|) =n

    i=1

    f(xi|) = 1(2)n/2n

    ePn

    i=1(xi)222 .

    Then Bayes Theorem applies just as before:

    (|x1,x2, . . . ,xn) = f(x1,x2, . . . ,xn|) ()f(x1,x2, . . . ,xn|u) (u)duf(x1,x2, . . . ,xn|) ()= ()

    ni=1

    f(xi|).

    32/34

    Exercise

  • 7/25/2019 3lect Handout

    33/34

    If we still useN(,2)as the prior for the parameter, show that

    givenX1= x1,X2= x2, . . . ,Xn= xn, the posterior distribution for isN(, 2)with

    =

    1/2

    1/2 + n/2

    +

    n/2

    1/2 + n/2

    x

    where

    x=

    ni=1xi

    n

    is the sample average, and

    2 = 1

    1/2 + n/2.

    33/34

    Next ...

  • 7/25/2019 3lect Handout

    34/34

    Conjugate models.

    Point estimation.

    34/34