ash - 2007 - lectures on statistics.pdf

Upload: timhls

Post on 09-Oct-2015

68 views

Category:

Documents


0 download

TRANSCRIPT

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    1/113

    1

    Lectures On Statistics

    Robert B. Ash

    Preface

    These notes are based on a course that I gave at UIUC in 1996 and again in 1997. No

    prior knowledge of statistics is assumed. A standard first course in probability is a pre-

    requisite, but the first 8 lectures review results from basic probability that are important

    in statistics. Some exposure to matrix algebra is needed to cope with the multivariate

    normal distribution in Lecture 21, and there is a linear algebra review in Lecture 19. Here

    are the lecture titles:

    1. Transformation of Random Variables

    2. Jacobians

    3. Moment-generating functions

    4. Sampling from a normal population

    5. The T and F distributions

    6. Order statistics

    7. The weak law of large numbers

    8. The central limit theorem

    9. Estimation

    10. Confidence intervals

    11. More confidence intervals

    12. Hypothesis testing

    13. Chi square tests

    14. Sufficient statistics

    15. Rao-Blackwell theorem

    16. Lehmann-Scheffe theorem17. Complete sufficient statistics for the exponential class

    18. Bayes estimates

    19. Linear algebra review

    20. Correlation

    21. The multivariate normal distribution

    22. The bivariate normal distribution

    23. Cramer-Rao inequality

    24. Nonparametric statistics

    25. The Wilcoxon test

    ccopyright 2007 by Robert B. Ash. Paper or electronic copies for personal use may be

    made freely without explicit permission of the author. All other rights are reserved.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    2/113

    1

    Lecture 1. Transformation of Random Variables

    Suppose we are given a random variable X with density fX(x). We apply a function gto produce a random variable Y = g(X). We can think of X as the input to a blackbox, and Y the output. We wish to find the density or distribution function ofY. Weillustrate the technique for the example in Figure 1.1.

    -1

    2e-x

    1/2

    -1

    f (x)

    x-axis

    X

    Y

    y

    X-S

    qrt[y]

    Sqrt[y]

    Y = X2

    Figure 1.1

    The distribution function methodfinds FYdirectly, and then fY by differentiation.We have FY(y) = 0 for y 1 (Figure 1.3). Then

    FY(y) =1

    2+

    y0

    1

    2ex dx=

    1

    2+

    1

    2(1 ey).

    The density ofY is 0 for y

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    3/113

    2

    1/2

    -1 x-axis

    -Sqrt[y]

    Sqrt[y]

    f (x)X

    1'

    Figure 1.3

    fY(y) = 1

    4

    y(1 + e

    y), 0< y 1.

    See Figure 1.4 for a sketch offY andFY. (You can takefY(y) to be anything you like aty= 1 because{Y = 1}has probability zero.)

    y

    f (y)Y

    1' y

    F (y)Y

    1

    1

    2

    y +

    1

    2

    ( 1 - e

    - y

    1

    2

    +

    1

    2

    ( 1 - e -

    '

    Figure 1.4

    The density function method finds fY directly, and then FY by integration; seeFigure 1.5. We have fY(y)|dy| =fX(y)dx + fX(y)dx; we write|dy| because proba-bilities are never negative. Thus

    fY(y) = fX(

    y)

    |dy/dx|x=y + fX(y)|dy/dx|x=y

    withy = x2, dy/dx= 2x, so

    fY(y) =fX(

    y)

    2

    y +

    fX(y)2

    y .

    (Note that| 2y| = 2y.) We havefY(y) = 0 for y

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    4/113

    3

    Case 2. y >1 (see Figure 1.3).

    fY(y) = (1/2)ey

    2

    y + 0 = 1

    4

    ye

    y

    as before.

    Y

    y

    Xy- y

    Figure 1.5

    The distribution function method generalizes to situations where we have a single out-put but more than one input. For example, letXandYbe independent, each uniformlydistributed on [0, 1]. The distribution function ofZ= X+ Y is

    FZ(z) = P{X+ Y z} = x+yz

    fXY(x, y) dxdy

    withfXY(x, y) = fX(x)fY(y) by independence. Now FZ(z) = 0 for z 2 (because 0 Z 2).Case 1. If 0 z 1, then FZ(z) is the shaded area in Figure 1.6, which is z2/2.Case 2. If 1 z 2, thenFZ(z) is the shaded area in FIgure 1.7, which is 1 [(2z)2/2].Thus (see Figure 1.8)

    fZ(z) =

    z, 0 z 12 z, 1 z 20 elsewhere

    .

    Problems

    1. Let X, Y, Z be independent, identically distributed (from now on, abbreviated iid)random variables, each with density f(x) = 6x5 for 0 x 1, and 0 elsewhere. Findthe distribution and density functions of the maximum ofX, Y andZ.

    2. LetXand Ybe independent, each with densityex, x 0. Find the distribution (fromnow on, an abbreviation for Find the distribution or density function) ofZ= Y /X.

    3. A discrete random variable Xtakes values x1, . . . , xn, each with probability 1/n. LetY =g(X) where gis an arbitrary real-valued function. Express the probability functionofY (pY(y) = P{Y =y}) in terms ofg and the xi.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    5/113

    4

    1

    1

    y

    x

    z

    z

    1

    1

    y

    x

    x+y = z

    1 z 2

    2-z

    2-z

    Figures 1.6 and 1.7

    f (z)Z

    -1

    1 2' z

    Figure 1.8

    4. A random variable X has density f(x) = ax2 on the interval [0, b]. Find the densityofY =X3.

    5. TheCauchy densityis given by f(y) = 1/[(1 + y2)] for all real y . Show that one wayto produce this density is to take the tangent of a random variableXthat is uniformlydistributed between/2 and /2.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    6/113

    5

    Lecture 2. Jacobians

    We need this idea to generalize the density function method to problems where there arek inputs and k outputs, with k 2. However, if there are k inputs and j < k outputs,often extra outputs can be introduced, as we will see later in the lecture.

    2.1 The Setup

    Let X = X(U, V), Y = Y(U, V). Assume a one-to-one transformation, so that we cansolve for U and V . ThusU =U(X, Y), V = V(X, Y). Look at Figure 2.1. Ifu changesbydu thenx changes by (x/u) duand y changes by (y/u) du. Similarly, ifv changesbydv thenx changes by (x/v) dv and y changes by (y/v) dv. The small rectangle intheu v plane corresponds to a small parallelogram in the x y plane (Figure 2.2), withA = (x/u,y/u, 0) du and B = (x/v,y/v, 0) dv. The area of the parallelogramis

    |A

    B

    |and

    A B =

    I J Kx/u y/u 0x/v y/v 0

    dudv=x/u x/vy/u y/v

    du dvK.(A determinant is unchanged if we transpose the matrix, i.e., interchange rows andcolumns.)

    x

    udu

    y

    udu

    x

    y

    u

    v

    du

    dvR

    Figure 2.1

    A

    B

    S

    Figure 2.2

    2.2 Definition and Discussion

    TheJacobianof the transformation is

    J=

    x/u x/vy/u y/v , written as (x, y)(u, v) .

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    7/113

    6

    Thus|AB | =|J| dudv. Now P{(X, Y) S} = P{(U, V) R}, in other words,fXY(x, y) times the area ofS isfUV(u, v) times the area ofR. Thus

    fXY(x, y)|J| dudv= fUV(u, v) dudv

    and

    fUV(u, v) = fXY(x, y)

    (x, y)(u, v).

    The absolute value of the Jacobian (x, y)/(u, v) gives a magnification factor for areain going from u v coordinates to x y coordinates. The magnification factor going theother way is|(u, v)/(x, y)|. But the magnification factor from u v tou v is 1, so

    fUV(u, v) = fXY(x, y)

    (u, v)/(x, y) .In this formula, we must substitute x = x(u, v), y = y(u, v) to express the final result interms ofu and v .

    In three dimensions, a small rectangular box with volume dudvdw corresponds to aparallelepiped in xyz space, determined by vectors

    A=

    x

    u

    y

    u

    z

    u

    du, B=

    x

    v

    y

    v

    z

    v

    dv, C=

    x

    w

    y

    w

    z

    w

    dw.

    The volume of the parallelepiped is the absolute value of the dot product ofA withB C,and the dot product can be written as a determinant with rows (or columns) A, B, C. Thisdeterminant is the Jacobian ofx, y, z with respect to u, v, w[written(x, y, z)/(u,v,w)],timesdudvdw. The volume magnification fromuvw to xyz space is

    |(x, y, z)/(u, v, w)

    |and we have

    fUVW(u,v,w) = fXY Z(x, y, z)

    |(u,v,w)/(x, y, z)|withx = x(u,v,w), y= y(u,v,w), z= z(u, v, w).

    The jacobian technique extends to higher dimensions. The transformation formula isa natural generalization of the two and three-dimensional cases:

    fY1Y2Yn(y1, . . . , yn) = fX1Xn(x1, . . . , xn)

    |(y1, . . . , yn)/(x1, . . . , xn)|where

    (y1, . . . , yn)

    (x1, . . . , xn)=

    y1x1

    y1xn...

    ynx1

    ynxn

    .

    To help you remember the formula, think f(y) dy= f(x)dx.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    8/113

    7

    2.3 A Typical Application

    LetXand Ybe independent, positive random variables with densities fX andfY, and letZ= X Y. We find the density ofZby introducing a new random variable W, as follows:

    Z= XY, W =Y

    (W =Xwould be equally good). The transformation is one-to-one because we can solveforX, Yin terms ofZ, WbyX= Z/W, Y =W. In a problem of this type, we must alwayspay attention to the range of the variables: x > 0, y > 0 is equivalent to z > 0, w > 0.Now

    fZW(z, w) = fXY(x, y)

    |(z, w)/(x, y)|x=z/w,y=wwith

    (z, w)(x, y) =z/x z/yw/x w/y = y x0 1 =y.

    Thus

    fZW(z, w) =fX(x)fY(y)

    w =

    fX(z/w)fY(w)

    w

    and we are left with the problem of finding the marginal density from a joint density:

    fZ(z) =

    fZW(z, w) dw=

    0

    1

    wfX(z/w)fY(w) dw.

    Problems

    1. The joint density of two random variables X1

    and X2

    is f(x1

    , x2

    ) = 2ex1e

    x2 ,

    where 0 < x1 < x2 0. The transformation equations are givenbyY1 = X1/(X1+ X2), Y2 = (X1+ X2)/(X1+ X2+ X3), Y3 = X1+ X2+ X3. Asbefore, find the joint density of the Yi and show thatY1, Y2 andY3 are independent.

    Comments on the Problem Set

    In Problem 3, notice that Y1Y2Y3 = X1, Y2Y3 = X1+X2, so X2 = Y2Y3Y1Y2Y3, X3 =(X1+ X2+ X3) (X1+ X2) = Y3 Y2Y3.IffXY(x, y) = g(x)h(y) for all x, y, thenXandY are independent, because

    f(y|x) = fXY(x, y)fX(x)

    = g(x)h(y)

    g(x) h(y) dy

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    9/113

    8

    which does not depend on x. The set of points where g (x) = 0 (equivalently fX(x) = 0)can be ignored because it has probability zero. It is important to realize that in this

    argument, for allx, y means thatxandy must be allowed to vary independently of eachother, so the set of possiblex and y must be of the rectangular form a < x < b, c < y < d.(The constantsa,b,c, dcan be infinite.) For example, iffXY(x, y) = 2e

    xey, 0< y < x,and 0 elsewhere, then XandY arenot independent. Knowingx forces 0< y < x, so theconditional distribution ofY given X= x certainly depends on x. Note that fXY(x, y)isnota function ofx alone times a function ofy alone. We have

    fXY(x, y) = 2exeyI[0< y < x]

    where the indicator Iis 1 for 0 < y < x and 0 elsewhere.

    In Jacobian problems, pay close attention to the range of the variables. For example, inProblem 1 we have y1 = 2x1, y2 = x2 x1, so x1 = y1/2, x2 = (y1/2) +y2. From theseequations it follows that 0< x1 < x2 0, y2 > 0.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    10/113

    9

    Lecture 3. Moment-Generating Functions

    3.1 Definition

    Themoment-generating functionof a random variable X is defined by

    M(t) = MX(t) = E[etX ]

    where t is a real number. To see the reason for the terminology, note thatM(t) is theexpectation of 1 + tX+ t2X2/2! + t3X3/3! + . Ifn= E(Xn), then-th moment ofX,and we can take the expectation term by term, then

    M(t) = 1 + 1t +2t2

    2! + + nt

    n

    n! + .

    Since the coefficient oftn in the Taylor expansion is M(n)(0)/n!, where M(n) is the n-thderivative ofM, we have

    n= M(n)(0).

    3.2 The Key Theorem

    IfY =ni=1 Xi where X1, . . . , X n are independent, then MY(t) =

    ni=1 MXi(t).

    Proof. First note that ifXand Y are independent, then

    E[g(X)h(Y)] =

    g(x)h(y)fXY(x, y) dxdy.

    SincefXY(x, y) = fX(x)fY(y), the double integral becomes

    g(x)fX(x) dx

    h(y)fY(y) dy= E[g(X)]E[h(Y)]

    and similarly for more than two random variables. Now ifY = X1+ +Xn with theXis independent, we have

    MY(t) = E[etY] = E[etX1 etXn] = E[etX1 ] E[etXn ] = MX1(t) MXn(t).

    3.3 The Main Application

    Given independent random variables X1, . . . , X n with densities f1, . . . , f n respectively,find the density ofY =

    ni=1 Xi.

    Step 1. Compute Mi(t), the moment-generating function ofXi, for each i.

    Step 2. Compute MY(t) =ni=1 Mi(t).

    Step 3. From MY(t) find fY(y).

    This technique is known as a transform method. Notice that the moment-generatingfunction and the density of a random variable are related by M(t) =

    etxf(x) dx.

    With t replaced byswe have a Laplace transform, and with t replaced by it we have aFourier transform. The strategy works because at step 3, the moment-generating functiondetermines the density uniquely. (This is a theorem from Laplace or Fourier transformtheory.)

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    11/113

    10

    3.4 Examples

    1. Bernoulli Trials. Let Xbe the number of successes in n trials with probability ofsuccessp on a given trial. Then X=X1+ + Xn, whereXi= 1 if there is a success ontriali and Xi= 0 if there is a failure on trial i. Thus

    Mi(t) = E[etXi ] = P{Xi = 1}et1 + P{Xi= 0}et0 =pet + q

    withp + q= 1. The moment-generating function ofX is

    MX(t) = (pet + q)n =

    nk=0

    n

    k

    pkqnketk.

    This could have been derived directly:

    MX(t) = E[etX ] =

    n

    k=0

    P{

    X=k}

    etk =

    n

    k=0

    nkpkqnketk = (pet + q)n

    by the binomial theorem.

    2. Poisson. We have P{X= k} =ek/k!, k= 0, 1, 2, . . . . Thus

    M(t) =k=0

    ek

    k! etk =e

    k=0

    (et)k

    k! = exp()exp(et) = exp[(et 1)].

    We can compute the mean and variance from the moment-generating function:

    E(X) = M(0) = [exp((et 1))et]t=0 = .Leth(, t) = exp[(et 1)]. Then

    E(X2) = M(0) = [h(, t)et + eth(, t)et]t=0 = + 2

    hence

    Var X=E(X2) [E(X)]2 = + 2 2 =.

    3. Normal(0,1). The moment-generating function is

    M(t) = E[etX ] =

    etx 1

    2ex

    2/2 dx

    Now(x2/2) + tx= (1/2)(x2 2tx + t2 t2) = (1/2)(x t)2 + (1/2)t2 so

    M(t) = et2

    /2

    12 exp[(x t)2/2] dx.

    The integral is the area under a normal density (mean t, variance 1), which is 1. Conse-quently,

    M(t) = et2/2.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    12/113

    11

    4. Normal(, 2). IfX is normal(, 2), then Y = (X )/ is normal(0,1). This is agood application of the density function method from Lecture 1:

    fY(y) = fX(x)

    |dy/dx|x=+y = 1

    2ey

    2/2.

    We have X= + Y, so

    MX(t) = E[etX ] = etE[etY ] = etMY(t).

    Thus

    MX(t) = etet

    22/2.

    Remember this technique, which is especially useful when Y =aX+ b and the moment-generating function ofX is known.

    3.5 Theorem

    IfX is normal(, 2) and Y =aX+ b, then Y is normal(a + b, a22).

    Proof. We compute

    MY(t) = E[etY] = E[et(aX+b)] = ebtMX(at) = e

    bteatea2t22/2.

    Thus

    MY(t) = exp[t(a + b)] exp(t2a22/2).

    Here is another basic result.

    3.6 Theorem

    LetX1, . . . , X n be independent, with Xi normal (i, 2i ). Then Y =

    ni=1 Xi is normal

    with mean =ni=1 i and variance

    2 =ni=1

    2i .

    Proof. The moment-generating function ofY is

    MY(t) =

    ni=1

    exp(tii+ t22i /2) = exp(t + t

    22/2).

    A similar argument works for the Poisson distribution; see Problem 4.

    3.7 The Gamma DistributionFirst, we define the gamma function () =

    0

    y1ey dy, > 0. We need threeproperties:

    (a) ( + 1) =(), the recursion formula;

    (b) (n + 1) =n!, n= 0, 1, 2, . . . ;

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    13/113

    12

    (c) (1/2) =

    .

    To prove (a), integrate by parts: () = 0

    eyd(y/). Part (b) is a special case of (a).

    For (c) we make the change of variable y = z2/2 and compute

    (1/2) =

    0

    y1/2ey dy=0

    2z1ez

    2/2zdz.

    The second integral is 2

    times half the area under the normal(0,1) density, that is,2

    (1/2) =

    .

    Thegamma density is

    f(x) = 1

    ()x1ex/

    where and are positive constants. The moment-generating function is

    M(t) =

    0

    [()]1x1etxex/ dx.

    Change variables viay = (t + (1/))x to get0

    [()]1

    y

    t + (1/)1

    ey dy

    t + (1/)which reduces to

    1

    1

    t

    = (1 t).

    In this argument, t must be less than 1/so that the integrals will be finite.

    SinceM(0) = f(x) dx=

    0

    f(x) dx in this case, with f 0, M(0) = 1 implies thatwe have a legal probability density. As before, moments can be calculated efficiently fromthe moment-generating function:

    E(X) = M(0) = (1 t)1()|t=0 = ;

    E(X2) = M(0) = ( 1)(1 t)2()2|t=0 = ( + 1)2.

    Thus

    Var X= E(X2

    ) [E(X)]2

    =2

    .

    3.8 Special Cases

    Theexponential densityis a gamma density with = 1 : f(x) = (1/)ex/, x 0, withE(X) = , E(X2) = 22, Var X= 2.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    14/113

    13

    A random variable X has the chi square density with r degrees of freedom (X = 2(r)for short, where r is a positive integer) if its density is gamma with = r/2 and = 2.

    Thus

    f(x) = 1

    (r/2)2r/2x(r/2)1ex/2, x 0

    and

    M(t) = 1

    (1 2t)r/2 , t < 1/2.

    ThereforeE[2(r)] = = r, Var[2(r)] = 2 = 2r.

    3.9 Lemma

    IfX is normal(0,1) then X2 is2(1).

    Proof. We compute the moment-generating function ofX2 directly:

    MX2(t) = E[etX2 ] =

    etx2 1

    2ex

    2/2 dx.

    Lety =

    1 2tx; the integral becomes

    12

    ey2/2 dy

    1 2t = (1 2t)1/2

    which is 2(1).

    3.10 Theorem

    IfX1, . . . , X n are independent, each normal (0,1), then Y =ni=1 X

    2i is2(n).

    Proof. By (3.9), each X2i is 2(1) with moment-generating function (1 2t)1/2. Thus

    MY(t) = (1 2t)n/2 fort

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    15/113

    14

    3.12 The Poisson Process

    This process occurs in many physical situations, and provides an application of the gammadistribution. For example, particles can arrive at a counting device, customers at a servingcounter, airplanes at an airport, or phone calls at a telephone exchange. Divide the timeinterval [0, t] into a large number n of small subintervals of length dt, so that n dt= t. IfIi, i= 1, . . . , n, is one of the small subintervals, we make the following assumptions:

    (1) The probability of exactly one arrival in Ii is dt, where is a constant.

    (2) The probability of no arrivals in Ii is 1 dt.(3) The probability of more than one arrival in Ii is zero.

    (4) IfAi is the event of an arrival in Ii, then the Ai, i= 1, . . . , nare independent.

    As a consequence of these assumptions, we have n = t/dt Bernoulli trials with prob-ability of success p = dt on a given trial. Asdt0 we have n and p0, withnp= t. We conclude that the number N[0, t] of arrivals in [0, t] is Poisson (t):

    P{N[0, t] = k} =et(t)k/k!, k= 0, 1, 2, . . . .SinceE(N[0, t]) = t, we may interpret as theaverage number of arrivals per unit time.

    Now let W1 be the waiting time for the first arrival. Then

    P{W1 > t} =P{no arrival in [0,t]} =P{N[0, t] = 0} =et, t 0.Thus FW1(t) = 1 et andfW1(t) =et, t 0. From the formulas for the mean andvariance of an exponential random variable we have E(W1) = 1/and Var W1 = 1/

    2.

    LetWk be the (total) waiting time for the k -th arrival. Then Wk is the waiting timefor the first arrival plus the time after the first up to the second arrival plus plus thetime after arrival k 1 up to the k-th arrival. Thus Wk is the sum of k independentexponential random variables, and

    MWk(t) =

    1

    1 (t/)k

    so Wk is gamma with = k, = 1/. Therefore

    fWk(t) = 1

    (k 1)! ktk1et, t 0.

    Problems

    1. Let X1 and X2 be independent, and assume that X1 is 2(r1) and Y = X1+ X2 is2(r), where r > r1. Show that X2 is

    2(r2), where r2 = r r1.

    2. Let X1 and X2 be independent, with Xi gamma with parameters i and i, i = 1, 2.If c1 and c2 are positive constants, find convenient sufficient conditions under whichc1X1+ c2X2 will also have the gamma distribution.

    3. If X1, . . . , X n are independent random variables with moment-generating functionsM1, . . . , M n, andc1, . . . , cn are constants, express the moment-generating function Mofc1X1+ + cnXn in terms of the Mi.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    16/113

    15

    4. IfX1, . . . , X n are independent, with Xi Poisson(i), i= 1, . . . , n, show that the sumY = ni=1 Xi has the Poisson distribution with parameter = ni=1 i.

    5. An unbiased coin is tossed independentlyn1times and then again tossed independentlyn2 times. Let X1 be the number of heads in the first experiment, and X2 the numberoftailsin the second experiment. Without using moment-generating functions, in factwithout any calculation at all, find the distribution ofX1+ X2.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    17/113

    16

    Lecture 4. Sampling From a Normal Population

    4.1 Definitions and Comments

    LetX1, . . . , X n be iid. The sample meanof the Xi is

    X= 1

    n

    ni=1

    Xi

    and the sample variance is

    S2 = 1

    n

    ni=1

    (Xi X)2.

    If theXi have mean and variance 2, then

    E(X) = 1n

    ni=1

    E(Xi) = 1n

    n=

    and

    Var X= 1

    n2

    ni=1

    Var Xi=n2

    n2 =

    2

    n 0 as n .

    Thus X is a good estimate of . (For large n, the variance of X is small, so X isconcentrated near its mean.) The sample variance is an average squared deviation fromthe sample mean, but it is a biased estimate of the true variance 2:

    E[(Xi X)2] = E[(Xi ) (X )]2 = Var Xi+ Var X 2E[(Xi )(X )].Notice the centralizing technique. We subtract and add back the mean ofXi, which willmake the cross terms easier to handle when squaring. The above expression simplifies to

    2 +2

    n 2E[(Xi ) 1

    n

    nj=1

    (Xj )] = 2 + 2

    n 2

    nE[(Xi )2].

    Thus

    E[(Xi X)2] = 2(1 + 1n 2

    n) =

    n 1n

    2.

    Consequently, E(S2) = (n 1)2/n, not 2. Some books define the sample variance as1

    n 1

    n

    i=1

    (Xi

    X)2 = n

    n 1S2

    whereS2 is our sample variance. This adjusted estimate of the true variance is unbiased(its expectation is 2), but biased does not mean bad. If we measure performance byasking for a small mean square error, the biased estimate is better in the normal case, aswe will see at the end of the lecture.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    18/113

    17

    4.2 The Normal Case

    We now assume that theXi are normally distributed, and find the distribution ofS2

    . Lety1 = x = (x1 + +xn)/n, y2 = x2x, . . . , yn= xnx. Theny1 +y2 = x2, y1+y3 =x3, . . . , y1 + yn= xn. Add these equations to get (n 1)y1 + y2 + + yn= x2 + + xn,or

    ny1+ (y2+ + yn) = (x2+ + xn) + y1 (1)Butny1 = nx = x1 + + xn, so by cancellingx2, . . . , xn in (1), x1 + (y2 + + yn) = y1.Thus we can solve for the xs in terms of they s:

    x1 = y1 y2 ynx2 = y1+ y2

    x3 = y1+ y3 (2)

    ...

    xn= y1+ yn

    The Jacobian of the transformation is

    dn=(x1, . . . , xn)

    (y1, . . . , yn) =

    1 1 1 11 1 0 01 0 1 0...1 0 0 1

    To see the pattern, look at the 4 by 4 case and expand via the last row:

    1 1 1 11 1 0 01 0 1 01 0 0 1

    = (1)

    1 1 11 0 00 1 0

    +1 1 11 1 01 0 1

    sod4 = 1 + d3. In general, dn= 1 + dn1, and since d2 = 2 by inspection, we havedn= nfor all n 2. Now

    ni=1

    (xi )2 =

    (xi x + x )2 =

    (xi x)2 + n(x )2 (3)

    because

    (xix) = 0. By (2), x1x= x1y1 = y2 ynand xix= xiy1 = yifori = 2, . . . , n. (Remember that y1 = x.) Thus

    ni=1

    (xi x)2 = (y2 yn)2 +ni=2

    y2i (4)

    Now

    fY1Yn(y1, . . . , yn) = nfX1Xn(x1, . . . , xn).

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    19/113

    18

    By (3) and (4), the right side becomes, in terms of the yis,

    n

    12

    n

    exp

    122

    ni=2

    yi2 ni=2

    y2i n(y1 )2

    .

    The joint density of Y1, . . . , Y n is a function of y1 times a function of (y2, . . . , yn), soY1 and (Y2, . . . , Y n) are independent. Since X = Y1 and [by (4)] S

    2 is a function of(Y2, . . . , Y n),

    X and S2 are independent

    Dividing Equation (3) by 2 we have

    n

    i=1

    Xi

    2

    =nS2

    2

    + X

    /n2

    .

    But (Xi )/ is normal (0,1) and

    X /

    n

    is normal (0,1)

    so2(n) = (nS2/2) + 2(1) with the two random variables on the right independent. IfM(t) is the moment-generating function ofnS2/2, then (12t)n/2 =M(t)(12t)1/2.ThereforeM(t) = (1 2t)(n1)/2, i.e.,

    nS2

    2 is 2(n 1)

    The random variable

    T = X S/

    n 1

    is useful in situations where is to be estimated but the true variance 2 is unknown. Itturns out that Thas a Tdistribution, which we study in the next lecture.

    4.3 Performance of Various Estimates

    Let S2 be the sample variance of iid normal (, 2) random variables X1, . . . , X n. Wewill look at estimates of2 of the formcS2, wherec is a constant. Once again employing

    the centralizing technique, we writeE[(cS2 2)2] = E[(cS2 cE(S2) + cE(S2) 2)2]

    which simplifies to

    c2 Var S2 + (cE(S2) 2)2.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    20/113

    19

    SincenS2/2 is2(n1), which has variance 2(n1), we haven2(Var S2)/4 = 2(n1).Also nE(S2)/2 is the mean of2(n

    1), which is n

    1. (Or we can recall from (4.1)

    thatE(S2) = (n 1)2/n.) Thus the mean square error isc224(n 1)

    n2 +

    c

    (n 1)n

    2 22.We can drop the 4 and use n2 as a common denominator, which can also be dropped.We are then trying to minimize

    c22(n 1) + c2(n 1)2 2c(n 1)n + n2.Differentiate with respect to c and set the result equal to zero:

    4c(n 1) + 2c(n 1)2 2(n 1)n= 0.Dividing by 2(n 1), we have 2c+c(n 1) n = 0, so c = n/(n+ 1). Thus the bestestimate of the form cS2 is

    1

    n + 1

    ni=1

    (Xi X)2.

    If we use S2 then c = 1. If we us the unbiased version then c = n/(n1). Since[n/(n+ 1)] < 1 < [n/(n 1)] and a quadratic function decreases as we move towardits minimum, w see that the biased estimate S2 is better than the unbiased estimatenS2/(n1), but neither is optimal under the minimum mean square error criterion.Explicitly, whenc = n/(n 1) we get a mean square error of 24/(n 1) and whenc = 1we get

    4

    n2

    2(n 1) + (n 1 n)2 =(2n 1)4n2

    which is always smaller, because [(2n

    1)/n2] < 2/(n

    1) iff 2n2 > 2n2

    3n+ 1 iff

    3n >1, which is true for every positive integer n.

    For largen all these estimates are good and the difference between their performanceis small.

    Problems

    1. Let X1, . . . , X n be iid, each normal (, 2), and let Xbe the sample mean. Ifc is a

    constant, we wish to maken large enough so that P{ c < X < + c} .954. Findthe minimum value ofn in terms of2 andc. (It is independent of.)

    2. Let X1, . . . , X n1 , Y1, . . . Y n2 be independent random variables, with the Xi normal(1, 21) and the Yi normal (2,

    22). IfX is the sample mean of the Xi and Y is the

    sample mean of the Yi, explain how to compute the probability thatX > Y.

    3. LetX1, . . . , X nbe iid, each normal (, 2), and letS2 be the sample variance. Explainhow to compute P{a < S2 < b}.

    4. Let S2 be the sample variance of iid normal (, 2) random variables Xi, i= 1 . . . , n.Calculate the moment-generating function ofS2 and from this, deduce that S2 has agamma distribution.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    21/113

    20

    Lecture 5. The T and F Distributions

    5.1 Definition and Discussion

    TheTdistributionis defined as follows. Let X1 and X2 be independent, with X1 normal(0,1) andX2chi-square withr degrees of freedom. The random variable Y1 =

    rX1/

    X2

    has the Tdistribution with r degrees of freedom.

    To find the density ofY1, let Y2 = X2. ThenX1 = Y1

    Y2/

    r and X2 = Y2. Thetransformation is one-to-one with < X1 < , X2 >0 < Y1 < , Y2 >0.The Jacobian is given by

    (x1, x2)

    (y1, y2) =

    y2/r y1/(2

    ry2)0 1

    = y2/r.ThusfY1Y2(y1, y2) = fX1X2(x1, x2)

    y2/r, which upon substitution forx1and x2becomes

    12

    exp[y21y2/2r] 1(r/2)2r/2 y(r/2)1

    2 ey2/2

    y2/r.

    The density ofY1 is

    12(r/2)2r/2

    0

    y[(r+1)/2]12 exp[(1 + (y21/r))y2/2] dy2/

    r.

    Withz = (1 + (y21/r))y2/2 and the observation that all factors of 2 cancel, this becomes(withy1 replaced by t)

    ((r+ 1)/2)r(r/2)

    1

    (1 + (t2/r))(r+1)/2, < t < ,

    theTdensitywithr degrees of freedom.

    In sampling from a normal population, (X )/(/n) is normal (0,1), and nS2

    /

    2

    is2(n 1). Thus

    n 1 (X )/

    n

    divided by

    nS/ is T(n 1).

    Since and

    n disappear after cancellation, we have

    X S/

    n 1 is T(n 1)

    Advocates of defining the sample variance with n 1 in the denominator point out thatone can simply replace byS in (X )/(/n) to get the T statistic.

    Intuitively, we expect that for large n, (X )/(S/n 1) has approximately thesame distribution as (X)/(/n), i.e., normal (0,1). This is in fact true, as suggestedby the following computation:

    1 +t2

    r

    (r+1)/2=

    1 +

    t2

    r

    r 1 +

    t2

    r

    1/2

    et2 1 = et2/2

    as r .

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    22/113

    21

    5.2 A Preliminary Calculation

    Before turning to theFdistribution, we calculate the density ofU=X1/X2whereX1andX2 are independent, positive random variables. Let Y =X2, so that X1 =U Y , X 2 =Y(X1, X2, U , Y ) are all greater than zero). The Jacobian is

    (x1, x2)

    (u, y) =

    y u0 1 =y.

    Thus fUY(u, y) = fX1X2(x1, x2)y= fX1(uy)fX2(y), and the density ofU is

    h(u) =

    0

    yfX1(uy)fX2(y) dy.

    Now we take X1 to be 2(m), andX2 to be

    2(n). The density ofX1/X2 is

    h(u) = 12(m+n)/2(m/2)(n/2)

    u(m/2)10

    y[(m+n)/2]1ey(1+u)/2 dy.

    The substitution z = y(1 + u)/2 gives

    h(u) = 1

    2(m+n)/2(m/2)(n/2)u(m/2)1

    0

    z[(m+n)/2]1

    [(1 + u)/2][(m+n)/2]1ez

    2

    1 + udz.

    We abbreviate (a)(b)/(a + b) by(a, b). (We will have much more to say about thiswhen we discuss the beta distribution later in the lecture.) The above formula simplifiesto

    h(u) = 1

    (m/2, n/2)

    u(m/2)1

    (1 + u)(m+n)/2, u

    0.

    5.3 Definition and Discussion

    The F density is defined as follows. LetX1 and X2 be independent, with X1 = 2(m)

    andX2 = 2(n). WithUas in (5.2), let

    W =X1/m

    X2/n =

    n

    mU

    so that

    fW(w) = fU(u)

    du

    dw

    =m

    nfU

    m

    nw

    .

    Thus Whas density

    (m/n)m/2

    (m/2, n/2)

    w(m/2)1

    [1 + (m/n)w](m+n)/2, w 0,

    theF densitywith m and n degrees of freedom.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    23/113

    22

    5.4 Definitions and Calculations

    Thebeta function is given by

    (a, b) =

    10

    xa1(1 x)b1 dx, a, b >0.

    We will show that

    (a, b) =(a)(b)

    (a + b)

    which is consistent with our use of(a, b) as an abbreviation in (5.2). We make the changeof variable t = x2 to get

    (a) =0

    ta1et dt= 20

    x2a1ex2 dx.

    We now use the familiar trick of writing (a)(b) as a double integral and switching topolar coordinates. Thus

    (a)(b) = 4

    0

    0

    x2a1y2b1e(x2+y2) dxdy

    = 4

    /20

    d

    0

    (cos )2a1(sin )2b1er2

    r2a+2b1 dr.

    The change of variable u = r2 yields

    0

    r2a+2b1er2

    dr= (1/2)

    0

    ua+b1eu du= (a + b)/2.

    Thus

    (a)(b)

    2(a + b)=

    /20

    (cos )2a1(sin )2b1 d.

    Let z = cos2 , 1 z = sin2 ,dz =2cos sin d =2z1/2(1 z)1/2 dz. The aboveintegral becomes

    1

    2 0

    1

    za1(1

    z)b1 dz =1

    2 1

    0

    za1(1

    z)b1 dz =1

    2(a, b)

    as claimed. The beta density is

    f(x) = 1

    (a, b)xa1(1 x)b1, 0 x 1 (a,b >0).

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    24/113

    23

    Problems

    1. LetXhave the beta distribution with parametersa andb. Find the mean and varianceofX.

    2. Let T have the Tdistribution with 15 degrees of freedom. Find the value ofc whichmakes P{c T c} =.95.

    3. Let W have the F distribution with m and n degrees of freedom (abbreviated W =F(m, n)). Find the distribution of 1/W.

    4. A typical table of theFdistribution gives values ofP{W c} forc = .9, .95, .975 and.99. Explain how to find P{W c} for c = .1, .05, .025 and .01. (Use the result ofProblem 3.)

    5. Let X have the T distribution with n degrees of freedom (abbreviated X = T(n)).Show that T2(n) = F(1, n), in other words, T2 has an F distribution with 1 and ndegrees of freedom.

    6. IfXhas the exponential density ex, x 0, show that 2X is 2(2). Deduce that thequotient of two exponential random variables is F(2, 2).

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    25/113

    1

    Lecture 6. Order Statistics

    6.1 The Multinomial Formula

    Suppose we pick a letter from{A , B, C}, with P(A) =p1 =.3, P(B) =p2 =.5, P(C) =p3 = .2. If we do this independently 10 times, we will find the probability that theresulting sequence contains exactly 4 As, 3B s and 3Cs.

    The probability ofAAAABBBCC C, in that order, isp41p32p

    33. To generate all favorable

    cases, select 4 positions out of 10 for the As, then 3 positions out of the remaining 6 for theBs. The positions for theCs are then determined. One possibility is BCAABACCAB.The number of favorable cases is

    10

    4

    6

    3

    =

    10!

    4!6!

    6!

    3!3!=

    10!

    4!3!3!.

    Therefore the probability of exactly 4 As,3B s and 3 Cs is

    10!

    4!3!3!(.3)4(.5)3(.2)3

    In general, consider n independent trials such that on each trial, the result is ex-actly one of the events A1, . . . , Ar, with probabilities p1, . . . , pr respectively. Then theprobability that A1 occurs exactly n1 times, . . . , Ar occurs exactly nr times, is

    pn11 pnrr

    n

    n1

    n n1

    n2

    n n1 n2

    n3

    n n1 nr2nr1

    n4nr

    which reduces to the multinomial formula

    n!n1! nr!p

    n11 pnrr

    where the pi are nonnegative real numbers that sum to 1, and the ni are nonnegativeintegers that sum to n.

    Now let X1, . . . , X n be iid, each with density f(x) and distribution function F(x).Let Y1 < Y2 x} =ni=1

    P{Xi > x} = [1 F(x)]n.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    26/113

    2

    Therefore

    FY1(x) = 1 [1 F(x)]n and fY1(x) = n[1 F(x)]n1f(x).We compute fYk(x) by asking how it can happen that x Yk x+ dx (see Figure6.1). There must be k 1 random variables less than x, one random variable betweenx and x+dx, and n k random variables greater than x. (We are taking dx so smallthat the probability that more one random variable falls in [x, x+dx] is negligible, andP{Xi > x} is essentially the same as P{Xi > x+ dx}. Not everyone is comfortablewith this reasoning, but the intuition is very strong and can be made precise.) By themultinomial formula,

    fYk(x) dx= n!

    (k 1)!1!(n k)! [F(x)]k1f(x) dx[1 F(x)]nk

    so

    fYk(x) = n!

    (k 1)!1!(n k)! [F(x)]k1[1 F(x)]nkf(x).

    Similar reasoning (see Figure 6.2) allows us to write down the joint density fYjYk(x, y) ofYj andYk forj < k , namely

    n!

    (j 1)!(k j 1)!(n k)! [F(x)]j1[F(y) F(x)]kj1[1 F(y)]nkf(x)f(y)

    for x < y, and 0 elsewhere. [We drop the term 1! (=1), which we retained for emphasisin the formula for fYk(x).]

    k-1 1 n-k

    x x + dx' '

    Figure 6.1

    1

    x x + dx' '

    y y + dy

    j-1 k-j-1 1 n-k

    ' '

    Figure 6.2

    Problems

    1. LetY1 < Y2 < Y3 be the order statistics ofX1, X2 and X3, where theXi are uniformlydistributed between 0 and 1. Find the density ofZ= Y3 Y1.

    2. The formulas derived in this lecture assume that we are in the continuous case (thedistribution functionFis continuous). The formulas do not apply if theXiare discrete.Why not?

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    27/113

    3

    3. Consider order statistics where theXi, i= 1, . . . , n, are uniformly distributed between0 and 1. Show that Yk has a beta distribution, and express the parameters and in

    terms ofk andn.

    4. In Problem 3, let 0 < p < 1, and express P{Yk > p} as the probability of an eventassociated with a sequence ofn Bernoulli trials with probability of success p on a giventrial. WriteP{Yk > p} as a finite sum involving n, p and k .

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    28/113

    4

    Lecture 7. The Weak Law of Large Numbers

    7.1 Chebyshevs Inequality

    (a) IfX 0 and a >0, then P{X a} E(X)/a.(b) If X is an arbitrary random variable, c any real number, and > 0, m > 0, thenP{|X c| } E(|X c|m)/m.(c) IfXhas finite mean and finite variance2, then P{|X | k} 1/k2.

    This is a universal bound, but it may be quite weak in a specific cases. For example,ifXis normal (, 2), abbreviated N(, 2), then

    P{|X | 1.96} =P{|N(0, 1)| 1.96} = 2(1 (1.96)) =.05where is the distribution function of a normal (0,1) random variable. But the Chebyshevbound is 1/(1.96)2 =.26.

    Proof.(a) IfXhas density f, then

    E(X) =

    0

    xf(x) dx=

    a0

    xf(x) dx +

    a

    xf(x) dx

    so

    E(X) 0 + a

    af(x) dx= aP{X a}.

    (b)P{|X c| } =P{|X c|m m} E(|X c|m)/m by (a).(c) By (b) with c = , = k, m= 2, we have

    P{|X | k} E[(X

    )2]

    k22 =

    1

    k2 .

    7.2 Weak Law of Large Numbers

    LetX1, . . . , X nbe iid with finite meanand finite variance2. For largen, the arithmetic

    average of the observations is very likely to be very close to the true mean . Formally,ifSn= X1+ + Xn, then for any >0,

    P{Snn

    } 0 asn .Proof.

    P{Snn

    } =P{|Sn n| n} E[(Sn n)2]n22

    by Chebyshev (b). The term on the right is

    Var Snn22

    = n2

    n22 =

    2

    n2 0.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    29/113

    5

    7.3 Bernoulli Trials

    LetXi= 1 if there is a success on trial i, and Xi= 0 if there is a failure. ThusXi is theindicator of a success on trial i, often written as I[Success on trial i]. Then Sn/n is therelative frequency of success, and for large n, this is very likely to be very close to thetrue probability p of success.

    7.4 Definitions and Comments

    The convergence illustrated by the weak law of large numbers is called convergence in

    probability. Explicitly, Sn/n converges in probability to . In general,XnP Xmeans

    that for every >0, P{|Xn X| } 0 asn . Thus for large n, Xn is very likelyto be very close to X. IfXn converges in probability to X, then Xn converges to X indistribution: IfFn is the distribution function ofXn and Fis the distribution functionofX, then Fn(x)

    F(x) at every x whereF is continuous. To see that the continuity

    requirement is needed, look at Figure 7.1. In this example,Xn is uniformly distributedbetween 0 and 1/n, and X is identically 0. We haveXn

    P 0 because P{|Xn| } isactually 0 for large n. However, Fn(x) F(x) for x = 0, but not atx = 0.

    To prove that convergence in probability implies convergence in distribution:

    Fn(x) = P{Xn x} =P{Xn x,X > x + } + P{Xn x, X x + } P{|Xn X| } + P{X x + }=P{|Xn X| } + F(x + )

    F(x ) = P{X x } =P{X x , Xn> x} + P{X x , Xn x} P{|Xn X| } + P{Xn x}=P{|Xn X| } + Fn(x).

    Therefore

    F(x ) P{|Xn X| } Fn(x) P{|Xn X| } + F(x + ).

    SinceXn converges in probability to X, we haveP{|XnX| } 0 asn . IfF iscontinuous atx, thenF(x) andF(x + ) approachF(x) as 0. ThusFn(x) is boxedbetween two quantities that can be made arbitrarily close to F(x), so Fn(x) F(x).

    7.5 Some Sufficient Conditions

    In practice,P{|XnX| } may be difficult to compute, and it is useful to have sufficientconditions for convergence in probability that can often be easily checked.

    (1) IfE[(Xn X)2

    ] 0 as n , thenXnP

    X.(2) IfE(Xn) E(X) and Var(Xn X) 0, thenXn P X.Proof. The first statement follows from Chebyshev (b):

    P{|Xn X| } E[(Xn X)2]

    2 0.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    30/113

    6

    To prove (2), note that

    E[(Xn X)2

    ] = Var(Xn X) + [E(Xn) E(X)]2

    0. In this result, if X is identically equal to a constant c, then Var(Xn X) is simplyVar Xn. Condition (2) then becomesE(Xn) cand Var Xn 0, which implies thatXnconverges in probability toc.

    7.6 An Application

    In normal sampling, let S2n be the sample variance based on n observations. Lets show

    thatS2n is a consistent estimateof the true variance2, that is, S2n

    P 2. SincenS2n/2is2(n 1), we haveE(nS2n/2) = (n 1) and Var(nS2n/2) = 2(n 1). Thus E(S2n) =(n 1)2/n 2 and Var(S2n) = 2(n 1)4/n2 0, and the result follows.

    '1/n

    F (x)n

    x

    1 F (x)n

    olimn 1

    F(x)

    x

    o

    1

    Figure 7.1

    Problems

    1. Let X1, . . . , X n be independent, not necessarily identically distributed random vari-ables. Assume that the Xi have finite means i and finite variances

    2i , and the

    variances are uniformly bounded, i.e., for some positive number Mwe have 2i Mfor all i. Show that (Sn

    E(Sn))/n converges in probability to 0. This is a general-

    ization of the weak law of large numbers. For ifi = and 2i = 2 for all i, then

    E(Sn) = n, so (Sn/n) P 0, i.e., Sn/n P .2. Toss an unbiased coin once. If heads, write down the sequence 10101010 . . . , and if

    tails, write down the sequence 01010101 . . . . IfXn is the n-th term of the sequenceandX=X1, show that Xn converges to X in distribution but not in probability.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    31/113

    7

    3. LetX1, . . . , X nbe iid with finite mean and finite variance2. LetXnbe the sample

    mean (X1 +

    +Xn)/n. Find the limiting distribution of Xn, i.e., find a random

    variable Xsuch that Xn d X.4. LetXn be uniformly distributed between n and n + 1. Show that Xn does not have a

    limiting distribution. Intuitively, the probability has run away to infinity.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    32/113

    8

    Lecture 8. The Central Limit Theorem

    Intuitively, any random variable that can be regarded as the sum of a large numberof small independent components is approximately normal. To formalize, we need thefollowing result, stated without proof.

    8.1 Theorem

    IfYn has moment-generating function Mn, Y has moment-generating function M, andMn(t) M(t) as n for all t in some open interval containing the origin, thenYn

    d Y.

    8.2 Central Limit Theorem

    LetX1, X2, . . . be iid, each with finite mean , finite variance2, and moment-generating

    functionM. Then

    Yn=

    ni=1 Xi n

    n

    converges in distribution to a random variable that is normal (0,1). Thus for large n,ni=1 Xi is approximately normal.

    We will give an informal sketch of the proof. The numerator ofYn isn

    i=1(Xi ),and the random variables Xi are iid with mean 0 and variance 2. Thus we mayassume without loss of generality that = 0. We have

    MYn(t) = E[etYn ] = E

    exp

    t

    n

    n

    i=1Xi

    .

    The moment-generating function ofn

    i=1 Xi is [M(t)]n, so

    MYn(t) =

    M t

    n

    n.

    Now if the density of the Xi isf(x), then

    M t

    n

    =

    exp tx

    n

    f(x) dx

    =

    1 +

    txn

    + t2x2

    2!n2+

    t3x3

    3!n3/23+ f(x) dx

    = 1 + 0 + t2

    2n+ t

    3

    36n3/23

    + t4

    424n24

    +

    wherek = E[(Xi)k]. If we neglect the terms after t2/2n we have, approximately,

    MYn(t) =

    1 + t2

    2n

    n

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    33/113

    9

    which approaches the normal (0,1) moment-generating function et2/2 as n . This

    argument is very loose but it can be made precise by some estimates based on Taylors

    formula with remainder.

    We proved that ifXn converges in probability to X, thenXn convergence in distribu-tion to X. There is a partial converse.

    8.3 Theorem

    IfXn converges in distribution to a constantc, then Xn converges in probability to X.

    Proof. We estimate the probability that|Xn X| , as follows.P{|Xn X| } =P{Xn c + } + P{Xn c }

    = 1

    P

    {Xn< c +

    }+ P

    {Xn

    c

    }NowP{Xn c + (/2)} P{Xn< c + }, so

    P{|Xn c| } 1 P{Xn c + (/2)} + P{Xn c }

    = 1 Fn(c + (/2)) + Fn(c ).

    whereFn is the distribution function ofXn. But as long asx =c,Fn(x) converges to thedistribution function of the constant c, so Fn(x) 1 ifx > c, and Fn(x)0 ifx < c.ThereforeP{|Xn c| } 1 1 + 0 = 0 as n .

    8.4 Remarks

    IfY is binomial (n, p), the normal approximation to the binomialallows us to regard Yas approximately normal with mean np and variance npq (with q = 1 p). Accordingto Box, Hunter and Hunter, Statistics for Experimenters, page 130, the approximationworks well in practice ifn >5 and

    1n

    qp

    p

    q

    < .3If, for example, we wish to estimate the probability thatY= 50 or 51 or 52, we may writethis probability as P{49.5 < Y < 52.5} , and then evaluate as if Y were normal withmean np and variance np(1 p). This turns out to be slightly more accurate in practicethan using P{50 Y 52}.

    8.5 Simulation

    Most computers an simulate a random variable that is uniformly distributed between 0and 1. But what if we need a random variable with an arbitrary distribution functionF?For example, how would we simulate the random variable with the distribution functionof Figure 8.1? The basic idea is illustrated in Figure 8.2. IfY =F(X) where Xhas the

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    34/113

    10

    continuous distribution function F, then Y is uniformly distributed on [0,1]. (In Figure8.2 we have, for 0

    y

    1, P

    {Y

    y

    }=P

    {X

    x

    }=F(x) = y.)

    Thus ifXis uniformly distributed on [0,1] and w want Yto have distribution functionF, we setX= F(Y), Y = F1(X).

    In Figure 8.1 we must be more precise:

    Case 1. 0 X 3. LetX= (3/70)Y+ (15/70), Y= (70X 15)/3.Case 2. .3 X .8. LetY= 4, so P{Y = 4} =.5 as required.Case 3. .8 X 1. Let X= (1/10)Y + (4/10), Y = 10X 4.

    In Figure 8.1, replace the F(y)-axis by an x-axis to visualize X versus Y. Ify = y0corresponds tox = x0 [i.e., x0 = F(y0)], then

    P{Y y0} =P{X x0} =x0 = F(y0)

    as desired.

    -5 2 4 6

    o

    .3

    .8

    -

    -

    '' '

    1F(y)

    y

    110y+4

    10

    3

    70y+

    15

    70

    Figure 8.1

    Y = F(X)

    X

    y

    x

    1

    Figure 8.2

    Problems

    1. Let Xn be gamma (n, ), i.e., Xn has the gamma distribution with parameters n and. Show that Xn is a sum ofn independent exponential random variables, and fromthis derive the limiting distribution ofXn/n.

    2. Show that2(n) is approximately normal for largen (with mean n and variance 2n).

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    35/113

    11

    3. Let X1, . . . , X n be iid with density f. Let Yn be the number of observations thatfall into the interval (a, b). Indicate how to use a normal approximation to calculate

    probabilities involvingYn.

    4. If we have 3 observations 6.45, 3.14, 4.93, and we round off to the nearest integer, weget 6, 3, 5. The sum of integers is 14, but the actual sum is 14.52. Let Xi, i= 1, . . . , nbe the round-off error of the i-th observation, and assume that the Xi are iid anduniformly distributed on (1/2, 1/2). Indicate how to use a normal approximation tocalculate probabilities involving the total round off error Yn=

    ni=1 Xi.

    5. Let X1, . . . , X n be iid with continuous distribution function F, and let Y1 < < Ynbe the order statistics of the Xi. ThenF(X1), . . . , F (Xn) are iid and uniformly dis-tributed on [0,1] (see the discussion of simulation), with order statistics F(Y1), . . . , F (Yn).Show that n(1 F(Yn)) converges in distribution to an exponential random variable.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    36/113

    12

    Lecture 9. Estimation

    9.1 Introduction

    In effect the statistician plays a game against nature, who first chooses the state ofnature (a number or k-tuple of numbers in the usual case) and performs a randomexperiment. We do not know but we are allowed to observe the value of a randomvariable (or random vector)X, called the observable, with density f(x).

    After observingX=x we estimate by (x), which is called a point estimatebecauseit produces a single number which we hope is close to . The main alternative is aninterval estimate or confidence interval, which will be discussed in Lectures 10 and 11.

    For a point estimate (x) to make sense physically, it must depend only onx, not onthe unknown parameter . There are many possible estimates, and there are no generalrules for choosing a best estimate. Some practical considerations are:

    (a) How much does it cost to collect the data?

    (b) Is the performance of the estimate easy to measure, for example, can we computeP{|(x) | < }?(c) Are the advantages of the estimate appropriate for the problem at hand?

    We will study several estimation methods:

    1. Maximum likelihood estimates.

    These estimates usually have highly desirable theoretical properties (consistency), andare frequently not difficult to compute.

    2. Confidence intervals.

    These estimates have a very useful practical feature. We construct an interval from

    the data, and we will know the probability that our (random) interval actually containsthe unknown (but fixed) parameter.

    3. Uniformly minimum variance unbiased estimates (UMVUEs).

    Mathematical theory generates a large number of examples of these, but as we know,a biased estimate can sometimes be superior.

    4. Bayes estimates.

    These estimates are appropriate if it is reasonable to assume that the state of nature is a random variable with a known density.

    In general, statistical theory produces many reasonable candidates, and practical ex-perience will dictate the choice in a given physical situation.

    9.2 Maximum Likelihood Estimates

    We choose(x) = , a value of that makes what we have observed as likely as possible.

    In other words, let maximize the likelihood function L() = f(x), with x fixed. Thiscorresponds to basic statistical philosophy; if what we have observed is more likely under2 than under 1, we prefer 2 to1.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    37/113

    13

    9.3 Example

    LetXbe binomial (n, ). Then the probability that X= x when the true parameter is is

    f(x) =

    n

    x

    x(1 )nx, x= 0, 1, . . . , n.

    Maximizing f(x) is equivalent to maximizing ln f(x):

    ln f(x) =

    [x ln + (n x)ln(1 )] = x

    n x

    1 = 0.

    Thus x x n + x= 0, so = X/n, the relative frequency of success.Notation: will be written in terms of random variables, in this case X/n rather than

    x/n. Thus is itself a random variable.

    We haveE() = n/n= , so is unbiased. By the weak law of large numbers, P ,i.e., is consistent

    9.4 Example

    LetX1, . . . , X n be iid, normal (, 2), = (, 2). Then, with x = (x1, . . . , xn),

    f(x) =

    1

    2

    nexp

    ni=1

    (xi )222

    and

    ln f(x) = n

    2ln 2 n ln 1

    22

    ni=1

    (xi )2

    ;

    ln f(x) =

    1

    2

    ni=1

    (xi ) = 0,ni=1

    xi n= 0, = x;

    ln f(x) = n

    +

    1

    3

    ni=1

    (xi )2 = n3 2 +1

    n

    ni=1

    (xi )2

    = 0

    with = x. Thus

    2 = 1

    n

    ni=1

    (xi x)2 =s2.

    Case 1. and are both unknown. Then = (X, S2).

    Case 2. 2 is known. Then = and = Xas above. (Differentiation with respect to is omitted.)

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    38/113

    14

    Case 3. is known. Then = 2 and the equation (/) ln f(x) = 0 becomes

    2 = 1n

    ni=1

    (xi )2

    so

    = 1

    n

    ni=1

    (Xi )2.

    The sample mean X is an unbiased and (by the weak law of large numbers) consistentestimate of . The sample variance S2 is a biased but consistent estimate of 2 (seeLectures 4 and 7).

    Notation: We will abbreviate maximum likelihood estimate by MLE.

    9.5 The MLE of a Function of Theta

    Suppose that for a fixed x, f(x) is a maximum when = 0. Then the value of2 when

    f(x) is a maximum is 20. Thus to get the MLE of

    2, we simply square the MLE of .In general, ifh is any function, then h() = h().Ifh is continuous, then consistency is preserved, in other words:

    Ifh is continuous and P , then h() P h().

    Proof. Given > 0, there exists > 0 such that if| | < , then|h() h()| < .Consequently,

    P{|h() h()| } P{|

    | } 0 as n .

    (To justify the above inequality, note that if the occurrence of an event A implies theoccurrence of an eventB , then P(A) P(B).)

    9.6 The Method of Moments

    This is sometimes a quick way to obtain reasonable estimates. We set the observed k-thmomentn1

    ni=1 x

    ki equal to the theoreticalk-th momentE(X

    ki) (which will depend on

    the unknown parameter). Or we set the observedk-th central momentn1n

    i=1(xi)kequal to the theoretical k-th central moment E[(Xi )k]. For example, letX1, . . . X nbe iid, gamma with = 1, = 2, with 1, 2 > 0. Then E(Xi) = = 12 andVar Xi=

    2 =122 (see Lecture 3). We set

    X= 12, S2 =122

    and solve to get estimates i ofi, i= 1, 2, namely

    2 =S2

    X, 1 =

    X

    2=

    X2

    S2

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    39/113

    15

    Problems

    1. In this problem,X1, . . . , X n are iid with density f(x) or probability function p(x),and you are asked to find the MLE of.

    (a) Poisson (), >0.

    (b) f(x) = x1, 0< x < 1, where > 0. The probability is concentrated near the

    origin when 1.

    (c) Exponential with parameter , i.e., f(x) = (1/)ex/, x >0, where >0.

    (d)f(x) = (1/2)e|x|, where andx are arbitrary real numbers.

    (e) Translated exponential, i.e., f(x) =e(x), where is an arbitrary real number

    and x .2. let X1, . . . , X n be iid, each uniformly distributed between (1/2) and + (1/2).

    Find more than one MLE of (so MLEs are not necessarily unique).

    3. In each part of Problem 1, calculateE(Xi) and derive an estimate based on the methodof moments by setting the sample mean equal to the true mean. In each case, showthat the estimate is consistent.

    4. LetXbe exponential with parameter , as in Problem 1(c). Ifr >0, find the MLE ofP{X r}.

    5. IfXis binomial (n, ) and a and b are integers with 0abn, find the MLE ofP{a X b}.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    40/113

    16

    Lecture 10. Confidence Intervals

    10.1 Predicting an Election

    There are two candidates A and B . If a voter is selected at random, the probability thatthe voter favors A is p, where p is fixed but unknown. We selectn voters independentlyand ask their preference.

    The number Yn of A voters is binomial (n, p), which (for sufficiently large n), isapproximately normal with = np and 2 = np(1 p). The relative frequency of Avoters is Yn/n. We wish to estimate the minimum value ofn such that we can predictAs percentage of the vote within 1 percent, with 95 percent confidence. Thus we want

    PYn

    np < .01 > .95.

    Note that|

    (Yn/n)

    p|< .01 means that p is within .01 ofYn/n. So this inequality can

    be written as

    Ynn

    .01< p < Ynn

    + .01.

    Thus the probability that the randominterval In= ((Yn/n) .01, (Yn/n) + .01) containsthe true probability p is greater than .95. We say that In is a 95 percent confidenceinterval forp.

    In general, we find confidence intervals by calculating or estimating the probability ofthe event that is to occur with the desired level of confidence. In this case,

    P

    Ynn

    p

    < .01

    =P{|Yn np| < .01n} =P

    Yn np

    np(1 p)

    < .01

    n

    p(1 p)

    and this is approximately

    .01

    n

    p(1 p)

    .01np(1 p)

    = 2

    .01

    n

    p(1 p)

    1> .95

    where is the normal (0,1) distribution function. Since 1.95/2 = .975 and (1.96) =.975,we have

    .01

    np(1 p) >1.96, n > (196)

    2p(1 p).

    But (by calculus) p(1 p) is maximized when 1 2p = 0, p = 1/2, p(1 p) = 1/4.Thus n >(196)2/4 = (98)2 = (100

    2)2 = 10000

    400 + 4 = 9604.

    If we want to get within one tenth of one percent (.001) ofpwith 99 percent confidence,we repeat the above analysis with .01 replaced by .001, 1.99/2=.995 and (2 .6) = .995.Thus

    .001

    np(1 p) >2.6, n > (2600)

    2/4 = (1300)2 = 1, 690, 000.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    41/113

    17

    To get within 3 percent with 95 percent confidence, we have

    .03np(1 p) >1.96, n > 1963

    2

    14

    = 1067.

    If the experiment is repeated independently a large number of times, it is very likely thatour result will be within .03 of the true probabilityp at least 95 percent of the time. Theusual statement The margin of error of this poll is3% does not capture this idea.

    Note that the accuracy of the prediction depends only on the number of voters polledand not in total number of votes in the population. But the model assumes samplingwith replacement. (Theoretically, the same voter can be polled more than once since thevoters are selected independently.) In practice, sampling is done without replacement,but if the number n of voters polled is small relative to the population size N, the erroris very small.

    The normal approximation to the binomial (based on the central limit theorem) is

    quite reliable, and is used in practice even for modest values ofn; see (8.4).

    10.2 Estimating the Mean of a Normal Population

    LetX1, . . . , X n be iid, each normal (, 2). We will find a confidence interval for .

    Case 1. The variance 2 is known. Then Xis normal (, 2/n), so

    X /

    n

    is normal (0,1),

    hence

    P{b < n

    X

    < b} = (b) (b) = 2(b) 1

    and the inequality defining the confidence interval can be written as

    X bn

    < < X+ b

    n.

    We choose a symmetrical interval to minimize the length, because the normal densitywith zero mean is symmetric about 0. The desired confidence level determines b, whichthen determines the confidence interval.

    Case 2. The variance 2 is unknown. Recall from (5.1) that

    X S/

    n 1 is T(n 1)

    hence

    P{b < X S/

    n 1 < b} = 2FT(b) 1

    and the inequality defining the confidence interval can be written as

    X bSn 1 < < X+

    bSn 1 .

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    42/113

    18

    10.3 A Correction Factor When Sampling Without Replacement

    The following results will not be used and may be omitted, but it is interesting to measurequantitatively the effect of sampling without replacement. In the election predictionproblem, let Xi be the indicator of success (i.e.,selecting an A voter) on trial i. ThenP{Xi = 1} =p and P{Xi = 0} = 1 p. If sampling is done with replacement, then theXi are independent and the total numberX=X1+ + Xn ofA voters in the sample isbinomial (n, p). Thus the variance ofXis np(1p). However, if sampling is done withoutreplacement, then in effect we are drawing n balls from an urn containing Nballs (whereNis the size of the population), with N pballs labeledA and N(1 p) labeledB . Recallfrom basic probability theory that

    Var X=ni=1

    Var Xi+ 2i

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    43/113

    19

    Problems

    1. In the normal case [see (10.2)], assume that2

    is known. Explain how to compute thelength of the confidence interval for .

    2. Continuing Problem 1, assume that2 is unknown. Explain how to compute the lengthof the confidence interval for , in terms of the sample standard deviation S.

    3. Continuing Problem 2, explain how to compute the expected length of the confidenceinterval for , in terms of the unknown standard deviation . (Note that when isunknown, we expect a larger interval since we have less information.)

    4. Let X1, . . . , X n be iid, each gamma with parameters and . If is known, explainhow to compute a confidence interval for the mean = .

    5. In the binomial case [see (10.1)], suppose we specify the level of confidence and thelength of the confidence interval. Explain how to compute the minimum value ofn.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    44/113

    1

    Lecture 11. More Confidence Intervals

    11.1 Differences of Means

    Let X1, . . . , X n be iid, each normal (1, 2), and let Y1, . . . , Y m be iid, each normal

    (2, 2). Assume that (X1, . . . X n) and Y1, . . . , Y m) are independent. We will construct

    a confidence interval for 1 2. In practice, the interval is often used in the followingway. If the interval lies entirely to the left of 0, we have reason to believe that1 < 2.

    Since Var(X Y) = Var X+ Var Y = (2/n) + (2/m),X Y (1 2)

    1n +

    1m

    is normal (0,1).

    Also,nS21/2 is2(n1) andmS22/2 is2(m1). But2(r) is the sum ofr independent,

    normal (0,1) random variables, so

    nS212

    +mS22

    2 is 2(n + m 2).

    Thus if

    R=

    nS21 + mS

    22

    n + m 2

    1

    n+

    1

    m

    then

    T =X Y (1 2)

    R is T(n + m 2).

    Our assumption that both populations have the same variance is crucial, because theunknownvariance can be cancelled.

    IfP{b < T < b} =.95 we get a 95 percent confidence interval for 1 2:

    b < X Y (1 2)R

    < b

    or

    (X Y) bR < 1 2 < (X Y) + bR.If the variances 21 and

    22 areknown but possibly unequal, then

    X Y (1 2)2

    1

    n + 2

    2

    m

    is normal (0,1). IfR0 is the denominator of the above fraction, we can get a 95 percentconfidence interval as before: (b) (b) = 2(b) 1> .95,

    (X Y) bR0 < 1 2 < (X Y) + bR0.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    45/113

    2

    11.2 Example

    LetY1 andY2 be binomial (n1, p1) and (n2, p2) respectively. Then

    Y1 = X1+ + Xn1 and Y2 = Z1+ + Zn2where the Xi and Zj are indicators of success on trials i and j respectively. Assumethat X1, . . . X n1 , Z1, . . . , Z n2 are independent. Now E(Y1/n1) = p1 and Var(Y1/n1) =n1p1(1 p1)/n21 = p1(1p1)/n1, with similar formulas for Y2/n2. Thus for large n,

    Y1n1

    Y2n2

    (p1 p2)

    divided by

    p1(1p1)n1

    +p2(1 p2)

    n2

    is approximately normal (0,1). But this expression cannot be used to construct confidenceintervals forp1p2 because the denominator involves the unknown quantitiesp1 andp2.However, Y1/n1 converges in probability to p1 and Y2/n2 converges in probability to p2,and this justifies replacing p1 byY1/n1 andp2 byY2/n2 in the denominator.

    11.3 The Variance

    We will construct confidence intervals for the variance of a normal population. LetX1, . . . , X n be iid, each normal (,

    2), so that nS2/2 is 2(n1). If hn1 is the2(n 1) density and a and b are chosen to that ba hn1(x) dx= 1 , then

    P{a < nS2

    2 < b} = 1 .

    Buta

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    46/113

    3

    so if

    W=

    ni=1

    (Xi )2

    and we choose a and b so thatba

    hn(x) dx = 1 , then P{a < (W/2) < b} = 1 .The inequality defining the confidence interval can be written as

    W

    b < 2 0 might mean thatthe drug is a significant improvement.

    We observex and make a decision via (x) = 0 or 1. There are two types of errors. Atype 1 erroroccurs ifH0 is true but(x) = 1, in other words, we declare that H1 is true.Thus in a type 1 error, we rejectH0 when it is true.

    Atype 2 erroroccurs ifH0 is false but(x) = 0, i.e., we declare that H0 is true. Thusin a type 2 error, we acceptH0 when it is false.

    IfH0 [resp. H1] means that a patient does not have [resp. does have] a particulardisease, then a type 1 error is also called a false positive, and a type 2 error is also calleda false negative.

    If(x) is always 0, then a type 1 error can never occur, but a type 2 error will alwaysoccur. Symmetrically, if(x) is always 1, then there will always be a type 1 error, butnever an error of type 2. Thus by ignoring the data altogether we can reduce one of theerror probabilities to zero. To get botherror probabilities to be mall, in practice we mustincrease the sample size.

    We say that H0 [resp. H1] is simple if A0 [resp. A1] contains only one element,composite if A0 [resp. H1] contains more than one element. So in the case of simplehypothesis vs. simple alternative, we are testing = 0 vs. = 1. The standard exampleis to test the hypothesis that Xhas density f0 vs. the alternative that Xhas density f1.

    12.2 Likelihood Ratio Tests

    In the case of simple hypothesis vs. simple alternative, if we require that the probabilityof a type 1 error be at most and try to minimize the probability of a type 2 error, theoptimal test turns out to be a likelihood ratio test (LRT), defined as follows. LetL(x),the likelihood ratio, be f1(x)/f0(x), and let be a constant. IfL(x) > , reject H0; ifL(x)< , acceptH0; ifL(x) = , do anything.

    Intuitively, if what we have observed seems significantly more likely under H1, we willtend to rejectH0. IfH0 or H1 is composite, there is no general optimality result as thereis in the simple vs. simple case. In this situation, we resort to basic statistical philosophy:

    If, assuming thatH0 is true, we witness a are event, we tend to reject H0.The statement that LRTs are optimal is the Neyman-Pearson lemma, to be proved

    at the end of the lecture. In many common examples (normal, Poisson, binomial, ex-ponential,L(x1, . . . , xn) can be expressed as a function of the sum of the observations,or equivalently as a function of the sample mean. This motivates consideration of testsbased on

    ni=1 Xi or on X.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    48/113

    5

    12.3 Example

    Let X1, . . . , X n be iid, each normal (, 2

    ). We will testH0 : 0 vs. H1 : > 0.UnderH1, Xwill tend to be larger, so lets reject H0 when X > c. The power functionof the test is defined by

    K() = P{rejectH0},the probability of rejecting the null hypothesis when the true parameter is . In this case,

    P{X > c} =P

    X /

    n >

    c /

    n

    = 1

    c /

    n

    (see Figure 12.1). Suppose that we specify the probability of a type 1 error when = 1,and the probability of a type 2 error when = 2. Then

    K(1) = 1 c

    1

    /n =and

    K(2) = 1

    c 2/

    n

    = 1 .

    If, ,,1 and2 are known, we have two equations that can be solved for c and n.

    1

    K( )

    0 21

    Figure 12.1

    The critical region is the set of observations that lead to rejection. In this case, it is{(x1, . . . , xn) : n1

    ni=1 xi > c}.

    The significance level is the largest type 1 error probability. Here it is K(0), sinceK() increases with .

    12.4 Example

    Let H0 : Xis uniformly distributed on (0,1), so f0(x) = 1, 0 < x < 1, and 0 elsewhere.Let H1 : f1(x) = 3x

    2, 0 < x < 1, and 0 elsewhere. We take only one observation, andrejectH0 ifx > c, where 0< c < 1. Then

    K(0) =P0{X > c} = 1 c, K(1) =P1{X > c} = 1c

    3x2 dx= 1 c3.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    49/113

    6

    If we specify the probability of a type 1 error, then = 1 c, which determines c. Ifis the probability of a type 2 error, then 1

    = 1

    c3, so= c3. Thus (see Figure 12.2)

    = (1 )3.If = .05 then= (.95)3 .86, which indicates that you usually cant do too well withonly one observation.

    1

    1

    Figure 12.2

    12.5 Tests Derived From Confidence Intervals

    LetX1, . . . , X nbe iid, each normal (0, 2). In Lecture 10, we found a confidence interval

    for0, assuming 2 unknown, via

    P

    b < X 0

    S/

    n 1 < b

    = 2FT(b) 1 where T = X 0S/

    n 1has the Tdistribution with n 1 degrees of freedom.

    Say 2FT(b) 1 = .95, so that

    P X 0S/n 1

    b =.05If actually equals 0, we are witnessing an event of low probability. So it is natural totest = 0 vs. =0 by rejecting if X 0S/n 1

    b,in other words, 0 does not belong to the confidence interval. As the true mean moves away from 0 in either direction, the probability of this event will increase, sinceX 0 = (X ) + ( 0).

    Tests of = 0 vs. = 0 are called two-sided, as opposed to = 0 vs. > 0 (or= 0 vs. < 0), which are one-sided. In the present case, if we test = 0 vs. > 0,

    we reject ifX 0

    S/

    n 1 b.

    The power function K() is difficult to compute for = 0, because (X 0)/(/n)no longer has mean zero. The noncentral Tdistribution becomes involved.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    50/113

    7

    12.6 The Neyman-Pearson Lemma

    Assume that we are testing the simple hypothesis that X has density f0 vs. the simplealternative that X has density f1. Let be an LRT with parameter (a nonnegativeconstant), in other words, (x) is the probability of rejecting H0 when x is observed,and

    (x) =

    1 ifL(x)>

    0 ifL(x)<

    anything ifL(x) =

    Suppose that the probability of a type 1 error using is , and the probability of atype 2 error is . Let be an arbitrary test with error probabilities and . If then . In other words, the LRT has maximum power among all tests at significancelevel .

    Proof. We are going to assume that f0 and f1 are one-dimensional, but the argumentworks equally well when X= (X1, . . . , X n) and the fi aren-dimensional joint densities.We recall from basic probability theory the theorem of total probability, which says thatifXhas density f, then for any evert A,

    P(A) =

    P(A|X= x)f(x) dx.

    A companion theorem which we will also use later is the theorem of total expectation,which says that ifXhas density f, then for any random variable Y,

    E(Y) =

    E(Y|X= x)f(x) dx.

    By the theorem of total probability,

    =

    (x)f0(x) dx, 1 =

    (x)f1(x) dx

    and similarly

    =

    (x)f0(x) dx, 1 =

    (x)f1(x) dx.

    We claim that for all x,

    [(x) (x)][f1(x) f0(x)] 0.

    For iff1(x) > f0(x) then L(x) > , so (x) = 1 (x), and iff1(x) < f0(x) thenL(x) < , so (x) = 0 (x), proving the assertion. Now if a function is alwaysnonnegative, its integral must be nonnegative, so

    [(x) (x)][f1(x) f0(x)] dx 0.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    51/113

    8

    The terms involving f0translate to statements about type 1 errors, and the terms involvingf1 translate to statements about type 2 errors. Thus

    (1 ) (1 ) + 0,

    which says that ( ) 0, completing the proof.

    12.7 Randomization

    IfL(x) = , then do anything means that randomization is possible, e.g., we can flipa possibly biased coin to decide whether or not to accept H0. (This may be significantin the discrete case, where L(x) = may have positive probability.) Statisticians tendto frown on this practice because two statisticians can look at exactly the same data andcome to different conclusions. It is possible to adjust the significance level (by replacingdo anything by a definite choice of eitherH0 orH1 to avoid randomization.

    Problems

    1. Consider the problem of testing = 0 vs. > 0, where is the mean of a normalpopulation with known variance. Assume that the sample size n is fixed. Show thatthe test given in Example 12.3 (reject H0 ifX > c) is uniformly most powerful. Inother words, if we test = 0 vs. = 1 for any given 1 > 0, and we specify theprobability of a type 1 error, then the probability of a type 2 error is minimized.

    2. It is desired to test the null hypothesis that a die is unbiased vs. the alternative thatthe die is loaded, with faces 1 and 2 having probability 1/4 and faces 3,4,5 and 6 havingprobability 1/8. The die is to be tossed once. Find a most powerful test at level = .1,and find the type 2 error probability .

    3. We wish to test a binomial random variable X with n = 400 and H0 : p = 1/2 vs.H1 : p > 1/2. The random variable Y = (X np)/

    np(1 p) = (X 200)/10 isapproximately normal (0,1), and we will reject H0 if Y > c. If we specify = .05,then c = 1.645. Thus the critical region is X > 216.45. Suppose the actual result isX= 220, so that H0 is rejected. Find the minimum value of (sometimes called thep-value) for which the givendata lead to theoppositeconclusion (acceptance ofH0).

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    52/113

    9

    Lecture 13. Chi-Square Tests

    13.1 Introduction

    Let X1, . . . , X k be multinomial, i.e., Xi is the number of occurrences of the event Ai inn generalized Bernoulli trials (Lecture 6). Then

    P{X1= n1, . . . , X k = nk} = n!n1! nk!p

    n11 pnkk

    where the ni are nonnegative integers whose sum is n. Consider k = 2. Then X1 isbinomial (n, p1) and (X1np1)/

    np1(1 p1) normal(0,1). Consequently, the random

    variable (X1 np1)2/np1(1 p1) is approximately 2(1). But

    (X1

    np1)2

    np1(1 p1) =(X1

    np1)

    2

    n 1

    p1 +

    1

    1 p1

    =

    (X1

    np1)2

    np1 +

    (X2

    np2)2

    np2 .

    (Note that since k = 2 we have p2 = 1 p1 and X1 np1 = n X2 np1 = np2X2 =(X2 np2), and the outer minus sign disappears when squaring.) Therefore[(X1 np1)2/np1] + [(X2 np2)2/np2] 2(1). More generally, it can be shown that

    Q=k

    i=1

    (Xi npi)2npi

    2(k 1).

    where

    (Xi

    npi)

    2

    npi =

    (observed frequency-expected frequency)2

    expected frequency .

    We will consider three types of chi-square tests.

    13.2 Goodness of Fit

    We ask whetherXhas a specified distribution (normal, Poisson, etc.). The null hypothesisis that the multinomial probabilities are p = (p1, . . . , pk), and the alternative is that

    p = (p1, . . . , pk).Suppose that P{2(k 1) > c} is at the desired level of significance (for example,

    .05). IfQ > c we will rejectH0. The idea is that ifH0 is in fact true, we have witnesseda rare event, so rejection is reasonable. IfH

    0is false, it is reasonable to expect that some

    of theXi will be far from npi, so Q will be large.

    Some practical considerations: Taken large enough so that each npi 5. Each time aparameter is estimated from the sample, reduced the number of degrees by 1. (A typicalcase: The null hypothesis is thatX is Poisson (), but the mean is unknown, and isestimated by the sample mean.)

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    53/113

    10

    13.3 Equality of Distributions

    We ask whether two or more samples come from the same underlying distribution. Theobserved results are displayed in a contingency table. This is anh k matrix whose rowsare the samples and whose columns are the attributes to be observed. For example, rowi might be (7, 11, 15, 13, 4), with the interpretation that in a class of 50 students taughtby method of instruction i, there were 7 grades ofA, 11 ofB, 15 ofC, 13 ofD and 4ofF. The null hypothesisH0 is that there is no difference between the various methodsof instruction, i.e., P(A) is the same for each group, and similarly for the probabilities ofthe other grades. We estimateP(A) from the sample by adding all entries in column Aand dividing by the total number of observations in the entire experiment. We estimateP(B), P(C), P(D) and P(F) in a similar fashion. The expected frequencies in row i arefound by multiplying the grade probabilities by the number of entries in row i.

    If there arehgroups (samples), each withkattributes, then each group generates a chi-square (k

    1), andk

    1 probabilities are estimated from the sample (the last probability

    is determined). The number of degrees of freedom is h(k 1) (k 1) = (h 1)(k 1),call it r. IfP{2(r)> c} is the desired significance level, we reject H0 if the chi-squarestatistic is greater than c.

    13.4 Testing For Independence

    Again we have a contingency table with h rows corresponding to the possible values xi ofa random variableX, andk columns corresponding to the possible valuesyj of a randomvariable Y . We are testing the null hypothesis that X andY are independent.

    Let Ri be the sum of the entries in row i, and let Cj be the sum of the entriesin column j. Then the sum of all observations is T =

    i Ri =

    jCj . We estimate

    P{X=xi} by Ri/T, and P{Y =yj} by Cj/T. Under the independence hypothesis H0,P{

    X= xi, Y =yj}

    = P{

    X=xi}

    P{

    Y = yj}

    = RiCj/T2. Thus the expected frequency

    of (xi, yj) is RiCj/T. (This gives another way to calculate the expected frequencies in(13.3). In that case, we estimated the j-th column probability by Cj/T, and multipliedby the sum of the entries in row i, namelyRi.)

    In an h k contingency table, the number of degrees of freedom is hk 1 minus thenumber of estimated parameters:

    hk 1 (h 1 + k 1) = hk h k+ 1 = (h 1)(k 1).The chi-square statistic is calculated as in (13.3). Similarly, if there are 3 attributes tobe tested for independence and we form an h k m contingency table, the number ofdegrees of freedom is

    hkm 1 (h 1) + (k 1) + (m 1) = hkm h k m + 2.

    Problems

    1. Use a chi-square procedure to tests the null hypothesis that a random variableX hasthe following distribution:

    P{X= 1} =.5, P{X= 2} =.3, P{X= 3} =.2

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    54/113

    11

    We take 100 independent observations ofX, and it is observed that 1 occurs 40 times,2 occurs 33 times, and 3 occurs 27 times. Determine whether or not we will reject the

    null hypothesis at significance level .05

    2. Use a chi-square test to decide (at significance level .05) whether the two samples cor-responding to the rows of the contingency table below came from the same underlyingdistribution.

    A B CSample 1 33 147 114Sample 2 67 153 86

    3. Suppose we are testing for independence in a 2 2 contingency tablea bc d

    Show that the chi-square statistic is

    (ad bc)2(a + b + c + d)(a + b)(c + d)(a + c)(b + d)

    (The number of degrees of freedom is 1 1 = 1.)

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    55/113

    12

    Lecture 14. Sufficient Statistics

    14.1 Definitions and Comments

    Let X1, . . . , X n be iid with P{Xi = 1} = and P{Xi = 0} = 1 , so P{Xi = x} =x(1)1x, x = 0, 1. Let Y be a statistic for , i.e., a function of the observablesX1, . . . , X n. In this case we take Y =X1+ + Xn, the total number of successes in nBernoulli trials with probability of success on a given trial.

    We claim that the conditional distribution ofX1, . . . , X ngivenYis free of, in otherwords, does not depend on . We say that Y is sufficient for .

    To prove this, note that

    P{X1 = x1, . . . , X n= xn|Y =y} = P{X1 = x1, . . . , X n= xn, Y =y}P{Y =y} .

    This is 0 unless y = x1+ + xn, in which case we gety(1 )nyny

    y(1 )ny =

    1ny

    .For example, if we know that there were 3 heads in 5 tosses, the probability that theactual tosses were HTHHT is 1/

    53

    .

    14.2 The Key Idea

    For the purpose of making a statistical decision, we can ignore the individual randomvariables Xi and base the decision entirely on X1+ + Xn.

    Suppose that statistician A observes X1, . . . , X n and makes a decision. StatisticianB observes X1+ +Xn only, and constructs X1, . . . , X n according to the conditionaldistribution ofX1, . . . , X n given Y, i.e.,

    P{X1 = x1, . . . , X n= xn|Y =y} = 1ny

    .This construction is possible because the conditional distribution does not depend on theunknown parameter . We will show that under , (X1, . . . , X

    n) and (X1, . . . , X n) have

    exactly the same distribution, so anything A can do, B can do at least as well, even thoughB has less information.

    Givenx1, . . . , xn, lety = x1 + + xn. The only way we can haveX1 = x1, . . . , X n=xn is ifY =y and then Bs experiment produces X1 = x1, . . . , X

    n= xn given y . Thus

    P{X1 = x1, . . . , X n= xn} =P{Y =y}P{X1 = x1, . . . , X n= xn|Y =y}

    =

    n

    y

    y(1 )ny 1n

    y

    =y(1 )ny =P{X1 = x1, . . . , X n= xn}.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    56/113

    13

    14.3 The Factorization Theorem

    Let Y = u(X) be a statistic for ; (X can be (X1, . . . , X n), and usually is). ThenY is sufficient for if and only if the density f(x) of X under can be factored asf(x) = g(, u(x))h(x).

    [In the Bernoulli case, f(x1, . . . , xn) = y(1 )ny wherey = u(x) =ni=1 xi and

    h(x) = 1.]

    Proof. (Discrete case). IfY is sufficient, then

    P{X=x} =P{X=x, Y =u(x)} =P{Y =u(x)}P{X= x|Y =u(x)}

    =g(, u(x))h(x).

    Conversely, assume f(x) = g(, u(x))h(x). Then

    P{X=x|Y =y} = P{X= x, Y =y}P{Y =y} .

    This is 0 unless y = u(x), in which case it becomes

    P{X= x}P{Y =y} =

    g(, u(x))h(x){z:u(z)=y} g(, u(z))h(z)

    .

    The g terms in both numerator and denominator are g(, y), which can be cancelled toobtain

    P

    {X= x

    |Y =y

    }=

    h(x){z:u(z)=y} h(z)

    which is free of .

    14.4 Example

    LetX1, . . . , X n be iid, each normal (, 2), so that

    f(x1, . . . , xn) =

    1

    2

    nexp

    1

    22

    ni=1

    (xi )2

    .

    Take = (, 2) and let x = n1

    ni=1 xi, s

    2 =n1

    ni=1(xi x)2. Then

    xi x= xi (x )and

    s2 = 1

    n

    n1

    (xi )2 2(x )n1

    (xi ) + n(x )2

    .

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    57/113

    14

    Thus

    s2 = 1n

    n1

    (xi )2 (x )2.

    The joint density is given by

    f(x1, . . . , xn) = (22)n/2ens

    2/22en(x)2/22 .

    If and 2 are both unknown then (X, S2) is sufficient (take h(x) = 1). If2 is known

    then we can take h(x) = (22)n/2ens2/22 , = , and X is sufficient. If is known

    then (h(x) = 1) = 2 andn

    i=1(Xi )2 is sufficient.

    Problems

    In Problems 1-6, show that the given statistic u(X) = u(X1, . . . , X n) is sufficient for and find appropriate functions g and h for the factorization theorem to apply.

    1. TheXi are Poisson () andu(X) = X1+ + Xn.2. TheXihave densityA()B(xi), 0< xi< (and 0 elsewhere), where is a positive real

    number; u(X) = max Xi. As a special case, the Xi are uniformly distributed between0 and , andA() = 1/,B(xi) = 1 on (0, ).

    3. TheXi are geometric with parameter , i.e., if is the probability of success on a givenBernoulli trial, then P{Xi = x} = (1 )x is the probability that there will be xfailures followed by the first success; u(X) =

    ni=1 Xi.

    4. TheXi have the exponential density (1/)ex/, x >0, and u(X) =

    ni=1 Xi.

    5. TheXi have the beta density with parameters a = and b = 2, and u(X) = ni=1 Xi.6. The Xi have the gamma density with parameters = , an arbitrary positive

    number, and u(X) =n

    i=1 Xi.

    7. Show that the result in (14.2) that statistician B can do at least as well as statisticianA, holds in the general case of arbitrary iid random variables Xi.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    58/113

    15

    Lecture 15. Rao-Blackwell Theorem

    15.1 Background From Basic Probability

    To better understand the steps leading to the Rao-Blackwell theorem, consider a typicaltwo stage experiment:

    Step 1. Observe a random variableXwith density (1/2)x2ex, x > 0.

    Step 2. IfX= x, let Y be uniformly distributed on (0, x).

    FindE(Y).

    Method 1 via the joint density:

    f(x, y) = fX(x)fY(y|x) = 12

    x2ex(1

    x) =

    1

    2xex, 0< y < x.

    In general, E[g(X, Y)] =

    g(x, y)f(x, y) dxdy. In this case, g(x, y) = y and

    E(Y) =

    x=0

    xy=0

    y(1/2)xex dydx=

    0

    (x3/4)ex dx=3!

    4 =

    3

    2.

    Method 2via the theorem of total expectation:

    E(Y) =

    fX(x)E(Y|X= x) dx.

    Method 2 works well when the conditional expectation is easy to compute. In this caseit is x/2 by inspection. Thus

    E(Y) =0

    (1/2)x2ex(x/2) dx=3

    2 as before.

    15.2 Comment On Notation

    If, for example, it turns out that E(Y|X = x) = x2 + 3x+ 4, we can write E(Y|X) =X2 + 3X+ 4. Thus E(Y|X) is a function g(X) of the random variable X. When X= xwe haveg (x) = E(Y|X=x).

    We now proceed to the Rao-Blackwell theorem via several preliminary lemmas.

    15.3 Lemma

    E[E(X2

    |X1)] = E(X2).

    Proof. Let g(X1) = E(X2|X1). Then

    E[g(X1)] =

    g(x)f1(x) dx=

    E(X2|X1 = x)f1(x) dx= E(X2)

    by the theorem of total expectation.

  • 5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

    59/113

    16

    15.4 Lemma

    Ifi= E(Xi), i= 1, 2, thenE[{X2 E(X2|X1)}{E(X2|X1) 2}] = 0.

    Proof. The expectation is

    [x2 E(X2|X1 = x1)][E(X2|X1= x1) 2]f1(x1)f2(x2|x1) dx1 dx2

    =

    f1(x1)[E(X2|X1 = x1) 2]

    [x2 E(X2|X1 = x1)]f2(x2|x1) dx2 dx1.

    The inner integral (with respect to x2) isE(X2|X1 = x1) E(X2|X1 = x1) = 0, and theresult follows.

    15.5 Lemma

    Var X2 Var[E(X2|X1)].Proof. We have

    Var X2 = E[(X2 2)2] = E

    [{X2 E(X2|X1} + {E(X2|X1) 2}]2