ash - 2007 - lectures on statistics.pdf

5/19/2018 Ash - 2007 - Lectures On Statistics.pdf

1/113

1

Lectures On Statistics

Robert B. Ash

Preface

These notes are based on a course that I gave at UIUC in 1996 and again in 1997. No

prior knowledge of statistics is assumed. A standard first course in probability is a pre-

requisite, but the first 8 lectures review results from basic probability that are important

in statistics. Some exposure to matrix algebra is needed to cope with the multivariate

normal distribution in Lecture 21, and there is a linear algebra review in Lecture 19. Here

are the lecture titles:

1. Transformation of Random Variables

2. Jacobians

3. Moment-generating functions

4. Sampling from a normal population

5. The T and F distributions

6. Order statistics

7. The weak law of large numbers

8. The central limit theorem

9. Estimation

10. Confidence intervals

11. More confidence intervals

12. Hypothesis testing

13. Chi square tests

14. Sufficient statistics

15. Rao-Blackwell theorem

16. Lehmann-Scheffe theorem17. Complete sufficient statistics for the exponential class

18. Bayes estimates

19. Linear algebra review

20. Correlation

21. The multivariate normal distribution

22. The bivariate normal distribution

23. Cramer-Rao inequality

24. Nonparametric statistics

25. The Wilcoxon test

ccopyright 2007 by Robert B. Ash. Paper or electronic copies for personal use may be

made freely without explicit permission of the author. All other rights are reserved.


2/113

1

Lecture 1. Transformation of Random Variables

Suppose we are given a random variable X with density fX(x). We apply a function gto produce a random variable Y = g(X). We can think of X as the input to a blackbox, and Y the output. We wish to find the density or distribution function ofY. Weillustrate the technique for the example in Figure 1.1.

-1

2e-x

1/2

-1

f (x)

x-axis

X

Y

y

X-S

qrt[y]

Sqrt[y]

Y = X2

Figure 1.1

The distribution function methodfinds FYdirectly, and then fY by differentiation.We have FY(y) = 0 for y 1 (Figure 1.3). Then

FY(y) =1

2+

y0

1

2ex dx=

1

2+

1

2(1 ey).

The density ofY is 0 for y


3/113

2

1/2

-1 x-axis

-Sqrt[y]

Sqrt[y]

f (x)X

1'

Figure 1.3

fY(y) = 1

4

y(1 + e

y), 0< y 1.

See Figure 1.4 for a sketch offY andFY. (You can takefY(y) to be anything you like aty= 1 because{Y = 1}has probability zero.)

y

f (y)Y

1' y

F (y)Y

1

1

2

y +

1

2

( 1 - e

- y

1

2

+

1

2

( 1 - e -

'

Figure 1.4

The density function method finds fY directly, and then FY by integration; seeFigure 1.5. We have fY(y)|dy| =fX(y)dx + fX(y)dx; we write|dy| because proba-bilities are never negative. Thus

fY(y) = fX(

y)

|dy/dx|x=y + fX(y)|dy/dx|x=y

withy = x2, dy/dx= 2x, so

fY(y) =fX(

y)

2

y +

fX(y)2

y .

(Note that| 2y| = 2y.) We havefY(y) = 0 for y


4/113

3

Case 2. y >1 (see Figure 1.3).

fY(y) = (1/2)ey

2

y + 0 = 1

4

ye

y

as before.

Y

y

Xy- y

Figure 1.5

The distribution function method generalizes to situations where we have a single out-put but more than one input. For example, letXandYbe independent, each uniformlydistributed on [0, 1]. The distribution function ofZ= X+ Y is

FZ(z) = P{X+ Y z} = x+yz

fXY(x, y) dxdy

withfXY(x, y) = fX(x)fY(y) by independence. Now FZ(z) = 0 for z 2 (because 0 Z 2).Case 1. If 0 z 1, then FZ(z) is the shaded area in Figure 1.6, which is z2/2.Case 2. If 1 z 2, thenFZ(z) is the shaded area in FIgure 1.7, which is 1 [(2z)2/2].Thus (see Figure 1.8)

fZ(z) =

z, 0 z 12 z, 1 z 20 elsewhere

.

Problems

1. Let X, Y, Z be independent, identically distributed (from now on, abbreviated iid)random variables, each with density f(x) = 6x5 for 0 x 1, and 0 elsewhere. Findthe distribution and density functions of the maximum ofX, Y andZ.

2. LetXand Ybe independent, each with densityex, x 0. Find the distribution (fromnow on, an abbreviation for Find the distribution or density function) ofZ= Y /X.

3. A discrete random variable Xtakes values x1, . . . , xn, each with probability 1/n. LetY =g(X) where gis an arbitrary real-valued function. Express the probability functionofY (pY(y) = P{Y =y}) in terms ofg and the xi.


5/113

4

1

1

y

x

z

z

1

1

y

x

x+y = z

1 z 2

2-z

2-z

Figures 1.6 and 1.7

f (z)Z

-1

1 2' z

Figure 1.8

4. A random variable X has density f(x) = ax2 on the interval [0, b]. Find the densityofY =X3.

5. TheCauchy densityis given by f(y) = 1/[(1 + y2)] for all real y . Show that one wayto produce this density is to take the tangent of a random variableXthat is uniformlydistributed between/2 and /2.


6/113

5

Lecture 2. Jacobians

We need this idea to generalize the density function method to problems where there arek inputs and k outputs, with k 2. However, if there are k inputs and j < k outputs,often extra outputs can be introduced, as we will see later in the lecture.

2.1 The Setup

Let X = X(U, V), Y = Y(U, V). Assume a one-to-one transformation, so that we cansolve for U and V . ThusU =U(X, Y), V = V(X, Y). Look at Figure 2.1. Ifu changesbydu thenx changes by (x/u) duand y changes by (y/u) du. Similarly, ifv changesbydv thenx changes by (x/v) dv and y changes by (y/v) dv. The small rectangle intheu v plane corresponds to a small parallelogram in the x y plane (Figure 2.2), withA = (x/u,y/u, 0) du and B = (x/v,y/v, 0) dv. The area of the parallelogramis

|A

B

|and

A B =

I J Kx/u y/u 0x/v y/v 0

dudv=x/u x/vy/u y/v

du dvK.(A determinant is unchanged if we transpose the matrix, i.e., interchange rows andcolumns.)

x

udu

y

udu

x

y

u

v

du

dvR

Figure 2.1

A

B

S

Figure 2.2

2.2 Definition and Discussion

TheJacobianof the transformation is

J=

x/u x/vy/u y/v , written as (x, y)(u, v) .


7/113

6

Thus|AB | =|J| dudv. Now P{(X, Y) S} = P{(U, V) R}, in other words,fXY(x, y) times the area ofS isfUV(u, v) times the area ofR. Thus

fXY(x, y)|J| dudv= fUV(u, v) dudv

and

fUV(u, v) = fXY(x, y)

(x, y)(u, v).

The absolute value of the Jacobian (x, y)/(u, v) gives a magnification factor for areain going from u v coordinates to x y coordinates. The magnification factor going theother way is|(u, v)/(x, y)|. But the magnification factor from u v tou v is 1, so

fUV(u, v) = fXY(x, y)

(u, v)/(x, y) .In this formula, we must substitute x = x(u, v), y = y(u, v) to express the final result interms ofu and v .

In three dimensions, a small rectangular box with volume dudvdw corresponds to aparallelepiped in xyz space, determined by vectors

A=

x

u

y

u

z

u

du, B=

x

v

y

v

z

v

dv, C=

x

w

y

w

z

w

dw.

The volume of the parallelepiped is the absolute value of the dot product ofA withB C,and the dot product can be written as a determinant with rows (or columns) A, B, C. Thisdeterminant is the Jacobian ofx, y, z with respect to u, v, w[written(x, y, z)/(u,v,w)],timesdudvdw. The volume magnification fromuvw to xyz space is

|(x, y, z)/(u, v, w)

|and we have

fUVW(u,v,w) = fXY Z(x, y, z)

|(u,v,w)/(x, y, z)|withx = x(u,v,w), y= y(u,v,w), z= z(u, v, w).

The jacobian technique extends to higher dimensions. The transformation formula isa natural generalization of the two and three-dimensional cases:

fY1Y2Yn(y1, . . . , yn) = fX1Xn(x1, . . . , xn)

|(y1, . . . , yn)/(x1, . . . , xn)|where

(y1, . . . , yn)

(x1, . . . , xn)=

y1x1

y1xn...

ynx1

ynxn

.

To help you remember the formula, think f(y) dy= f(x)dx.


8/113

7

2.3 A Typical Application

LetXand Ybe independent, positive random variables with densities fX andfY, and letZ= X Y. We find the density ofZby introducing a new random variable W, as follows:

Z= XY, W =Y

(W =Xwould be equally good). The transformation is one-to-one because we can solveforX, Yin terms ofZ, WbyX= Z/W, Y =W. In a problem of this type, we must alwayspay attention to the range of the variables: x > 0, y > 0 is equivalent to z > 0, w > 0.Now

fZW(z, w) = fXY(x, y)

|(z, w)/(x, y)|x=z/w,y=wwith

(z, w)(x, y) =z/x z/yw/x w/y = y x0 1 =y.

Thus

fZW(z, w) =fX(x)fY(y)

w =

fX(z/w)fY(w)

w

and we are left with the problem of finding the marginal density from a joint density:

fZ(z) =

fZW(z, w) dw=

0

1

wfX(z/w)fY(w) dw.

Problems

1. The joint density of two random variables X1

and X2

is f(x1

, x2

) = 2ex1e

x2 ,

where 0 < x1 < x2 0. The transformation equations are givenbyY1 = X1/(X1+ X2), Y2 = (X1+ X2)/(X1+ X2+ X3), Y3 = X1+ X2+ X3. Asbefore, find the joint density of the Yi and show thatY1, Y2 andY3 are independent.

Comments on the Problem Set

In Problem 3, notice that Y1Y2Y3 = X1, Y2Y3 = X1+X2, so X2 = Y2Y3Y1Y2Y3, X3 =(X1+ X2+ X3) (X1+ X2) = Y3 Y2Y3.IffXY(x, y) = g(x)h(y) for all x, y, thenXandY are independent, because

f(y|x) = fXY(x, y)fX(x)

= g(x)h(y)

g(x) h(y) dy


9/113

8

which does not depend on x. The set of points where g (x) = 0 (equivalently fX(x) = 0)can be ignored because it has probability zero. It is important to realize that in this

argument, for allx, y means thatxandy must be allowed to vary independently of eachother, so the set of possiblex and y must be of the rectangular form a < x < b, c < y < d.(The constantsa,b,c, dcan be infinite.) For example, iffXY(x, y) = 2e

xey, 0< y < x,and 0 elsewhere, then XandY arenot independent. Knowingx forces 0< y < x, so theconditional distribution ofY given X= x certainly depends on x. Note that fXY(x, y)isnota function ofx alone times a function ofy alone. We have

fXY(x, y) = 2exeyI[0< y < x]

where the indicator Iis 1 for 0 < y < x and 0 elsewhere.

In Jacobian problems, pay close attention to the range of the variables. For example, inProblem 1 we have y1 = 2x1, y2 = x2 x1, so x1 = y1/2, x2 = (y1/2) +y2. From theseequations it follows that 0< x1 < x2 0, y2 > 0.


10/113

9

Lecture 3. Moment-Generating Functions

3.1 Definition

Themoment-generating functionof a random variable X is defined by

M(t) = MX(t) = E[etX ]

where t is a real number. To see the reason for the terminology, note thatM(t) is theexpectation of 1 + tX+ t2X2/2! + t3X3/3! + . Ifn= E(Xn), then-th moment ofX,and we can take the expectation term by term, then

M(t) = 1 + 1t +2t2

2! + + nt

n

n! + .

Since the coefficient oftn in the Taylor expansion is M(n)(0)/n!, where M(n) is the n-thderivative ofM, we have

n= M(n)(0).

3.2 The Key Theorem

IfY =ni=1 Xi where X1, . . . , X n are independent, then MY(t) =

ni=1 MXi(t).

Proof. First note that ifXand Y are independent, then

E[g(X)h(Y)] =

g(x)h(y)fXY(x, y) dxdy.

SincefXY(x, y) = fX(x)fY(y), the double integral becomes

g(x)fX(x) dx

h(y)fY(y) dy= E[g(X)]E[h(Y)]

and similarly for more than two random variables. Now ifY = X1+ +Xn with theXis independent, we have

MY(t) = E[etY] = E[etX1 etXn] = E[etX1 ] E[etXn ] = MX1(t) MXn(t).

3.3 The Main Application

Given independent random variables X1, . . . , X n with densities f1, . . . , f n respectively,find the density ofY =

ni=1 Xi.

Step 1. Compute Mi(t), the moment-generating function ofXi, for each i.

Step 2. Compute MY(t) =ni=1 Mi(t).

Step 3. From MY(t) find fY(y).

This technique is known as a transform method. Notice that the moment-generatingfunction and the density of a random variable are related by M(t) =

etxf(x) dx.

With t replaced byswe have a Laplace transform, and with t replaced by it we have aFourier transform. The strategy works because at step 3, the moment-generating functiondetermines the density uniquely. (This is a theorem from Laplace or Fourier transformtheory.)


11/113

10

3.4 Examples

1. Bernoulli Trials. Let Xbe the number of successes in n trials with probability ofsuccessp on a given trial. Then X=X1+ + Xn, whereXi= 1 if there is a success ontriali and Xi= 0 if there is a failure on trial i. Thus

Mi(t) = E[etXi ] = P{Xi = 1}et1 + P{Xi= 0}et0 =pet + q

withp + q= 1. The moment-generating function ofX is

MX(t) = (pet + q)n =

nk=0

n

k

pkqnketk.

This could have been derived directly:

MX(t) = E[etX ] =

n

k=0

P{

X=k}

etk =

n

k=0

nkpkqnketk = (pet + q)n

by the binomial theorem.

2. Poisson. We have P{X= k} =ek/k!, k= 0, 1, 2, . . . . Thus

M(t) =k=0

ek

k! etk =e

k=0

(et)k

k! = exp()exp(et) = exp[(et 1)].

We can compute the mean and variance from the moment-generating function:

E(X) = M(0) = [exp((et 1))et]t=0 = .Leth(, t) = exp[(et 1)]. Then

E(X2) = M(0) = [h(, t)et + eth(, t)et]t=0 = + 2

hence

Var X=E(X2) [E(X)]2 = + 2 2 =.

3. Normal(0,1). The moment-generating function is

M(t) = E[etX ] =

etx 1

2ex

2/2 dx

Now(x2/2) + tx= (1/2)(x2 2tx + t2 t2) = (1/2)(x t)2 + (1/2)t2 so

M(t) = et2

/2

12 exp[(x t)2/2] dx.

The integral is the area under a normal density (mean t, variance 1), which is 1. Conse-quently,

M(t) = et2/2.


12/113

11

4. Normal(, 2). IfX is normal(, 2), then Y = (X )/ is normal(0,1). This is agood application of the density function method from Lecture 1:

fY(y) = fX(x)

|dy/dx|x=+y = 1

2ey

2/2.

We have X= + Y, so

MX(t) = E[etX ] = etE[etY ] = etMY(t).

Thus

MX(t) = etet

22/2.

Remember this technique, which is especially useful when Y =aX+ b and the moment-generating function ofX is known.

3.5 Theorem

IfX is normal(, 2) and Y =aX+ b, then Y is normal(a + b, a22).

Proof. We compute

MY(t) = E[etY] = E[et(aX+b)] = ebtMX(at) = e

bteatea2t22/2.

Thus

MY(t) = exp[t(a + b)] exp(t2a22/2).

Here is another basic result.

3.6 Theorem

LetX1, . . . , X n be independent, with Xi normal (i, 2i ). Then Y =

ni=1 Xi is normal

with mean =ni=1 i and variance

2 =ni=1

2i .

Proof. The moment-generating function ofY is

MY(t) =

ni=1

exp(tii+ t22i /2) = exp(t + t

22/2).

A similar argument works for the Poisson distribution; see Problem 4.

3.7 The Gamma DistributionFirst, we define the gamma function () =

0

y1ey dy, > 0. We need threeproperties:

(a) ( + 1) =(), the recursion formula;

(b) (n + 1) =n!, n= 0, 1, 2, . . . ;


13/113

12

(c) (1/2) =

.

To prove (a), integrate by parts: () = 0

eyd(y/). Part (b) is a special case of (a).

For (c) we make the change of variable y = z2/2 and compute

(1/2) =

0

y1/2ey dy=0

2z1ez

2/2zdz.

The second integral is 2

times half the area under the normal(0,1) density, that is,2

(1/2) =

.

Thegamma density is

f(x) = 1

()x1ex/

where and are positive constants. The moment-generating function is

M(t) =

0

[()]1x1etxex/ dx.

Change variables viay = (t + (1/))x to get0

[()]1

y

t + (1/)1

ey dy

t + (1/)which reduces to

1

1

t

= (1 t).

In this argument, t must be less than 1/so that the integrals will be finite.

SinceM(0) = f(x) dx=

0

f(x) dx in this case, with f 0, M(0) = 1 implies thatwe have a legal probability density. As before, moments can be calculated efficiently fromthe moment-generating function:

E(X) = M(0) = (1 t)1()|t=0 = ;

E(X2) = M(0) = ( 1)(1 t)2()2|t=0 = ( + 1)2.

Thus

Var X= E(X2

) [E(X)]2

=2

.

3.8 Special Cases

Theexponential densityis a gamma density with = 1 : f(x) = (1/)ex/, x 0, withE(X) = , E(X2) = 22, Var X= 2.


14/113

13

A random variable X has the chi square density with r degrees of freedom (X = 2(r)for short, where r is a positive integer) if its density is gamma with = r/2 and = 2.

Thus

f(x) = 1

(r/2)2r/2x(r/2)1ex/2, x 0

and

M(t) = 1

(1 2t)r/2 , t < 1/2.

ThereforeE[2(r)] = = r, Var[2(r)] = 2 = 2r.

3.9 Lemma

IfX is normal(0,1) then X2 is2(1).

Proof. We compute the moment-generating function ofX2 directly:

MX2(t) = E[etX2 ] =

etx2 1

2ex

2/2 dx.

Lety =

1 2tx; the integral becomes

12

ey2/2 dy

1 2t = (1 2t)1/2

which is 2(1).

3.10 Theorem

IfX1, . . . , X n are independent, each normal (0,1), then Y =ni=1 X

2i is2(n).

Proof. By (3.9), each X2i is 2(1) with moment-generating function (1 2t)1/2. Thus

MY(t) = (1 2t)n/2 fort


15/113

14

3.12 The Poisson Process

This process occurs in many physical situations, and provides an application of the gammadistribution. For example, particles can arrive at a counting device, customers at a servingcounter, airplanes at an airport, or phone calls at a telephone exchange. Divide the timeinterval [0, t] into a large number n of small subintervals of length dt, so that n dt= t. IfIi, i= 1, . . . , n, is one of the small subintervals, we make the following assumptions:

(1) The probability of exactly one arrival in Ii is dt, where is a constant.

(2) The probability of no arrivals in Ii is 1 dt.(3) The probability of more than one arrival in Ii is zero.

(4) IfAi is the event of an arrival in Ii, then the Ai, i= 1, . . . , nare independent.

As a consequence of these assumptions, we have n = t/dt Bernoulli trials with prob-ability of success p = dt on a given trial. Asdt0 we have n and p0, withnp= t. We conclude that the number N[0, t] of arrivals in [0, t] is Poisson (t):

P{N[0, t] = k} =et(t)k/k!, k= 0, 1, 2, . . . .SinceE(N[0, t]) = t, we may interpret as theaverage number of arrivals per unit time.

Now let W1 be the waiting time for the first arrival. Then

P{W1 > t} =P{no arrival in [0,t]} =P{N[0, t] = 0} =et, t 0.Thus FW1(t) = 1 et andfW1(t) =et, t 0. From the formulas for the mean andvariance of an exponential random variable we have E(W1) = 1/and Var W1 = 1/

2.

LetWk be the (total) waiting time for the k -th arrival. Then Wk is the waiting timefor the first arrival plus the time after the first up to the second arrival plus plus thetime after arrival k 1 up to the k-th arrival. Thus Wk is the sum of k independentexponential random variables, and

MWk(t) =

1

1 (t/)k

so Wk is gamma with = k, = 1/. Therefore

fWk(t) = 1

(k 1)! ktk1et, t 0.

Problems

1. Let X1 and X2 be independent, and assume that X1 is 2(r1) and Y = X1+ X2 is2(r), where r > r1. Show that X2 is

2(r2), where r2 = r r1.

2. Let X1 and X2 be independent, with Xi gamma with parameters i and i, i = 1, 2.If c1 and c2 are positive constants, find convenient sufficient conditions under whichc1X1+ c2X2 will also have the gamma distribution.

3. If X1, . . . , X n are independent random variables with moment-generating functionsM1, . . . , M n, andc1, . . . , cn are constants, express the moment-generating function Mofc1X1+ + cnXn in terms of the Mi.


16/113

15

4. IfX1, . . . , X n are independent, with Xi Poisson(i), i= 1, . . . , n, show that the sumY = ni=1 Xi has the Poisson distribution with parameter = ni=1 i.

5. An unbiased coin is tossed independentlyn1times and then again tossed independentlyn2 times. Let X1 be the number of heads in the first experiment, and X2 the numberoftailsin the second experiment. Without using moment-generating functions, in factwithout any calculation at all, find the distribution ofX1+ X2.


17/113

16

Lecture 4. Sampling From a Normal Population

4.1 Definitions and Comments

LetX1, . . . , X n be iid. The sample meanof the Xi is

X= 1

n

ni=1

Xi

and the sample variance is

S2 = 1

n

ni=1

(Xi X)2.

If theXi have mean and variance 2, then

E(X) = 1n

ni=1

E(Xi) = 1n

n=

and

Var X= 1

n2

ni=1

Var Xi=n2

n2 =

2

n 0 as n .

Thus X is a good estimate of . (For large n, the variance of X is small, so X isconcentrated near its mean.) The sample variance is an average squared deviation fromthe sample mean, but it is a biased estimate of the true variance 2:

E[(Xi X)2] = E[(Xi ) (X )]2 = Var Xi+ Var X 2E[(Xi )(X )].Notice the centralizing technique. We subtract and add back the mean ofXi, which willmake the cross terms easier to handle when squaring. The above expression simplifies to

2 +2

n 2E[(Xi ) 1

n

nj=1

(Xj )] = 2 + 2

n 2

nE[(Xi )2].

Thus

E[(Xi X)2] = 2(1 + 1n 2

n) =

n 1n

2.

Consequently, E(S2) = (n 1)2/n, not 2. Some books define the sample variance as1

n 1

n

i=1

(Xi

X)2 = n

n 1S2

whereS2 is our sample variance. This adjusted estimate of the true variance is unbiased(its expectation is 2), but biased does not mean bad. If we measure performance byasking for a small mean square error, the biased estimate is better in the normal case, aswe will see at the end of the lecture.


18/113

17

4.2 The Normal Case

We now assume that theXi are normally distributed, and find the distribution ofS2

. Lety1 = x = (x1 + +xn)/n, y2 = x2x, . . . , yn= xnx. Theny1 +y2 = x2, y1+y3 =x3, . . . , y1 + yn= xn. Add these equations to get (n 1)y1 + y2 + + yn= x2 + + xn,or

ny1+ (y2+ + yn) = (x2+ + xn) + y1 (1)Butny1 = nx = x1 + + xn, so by cancellingx2, . . . , xn in (1), x1 + (y2 + + yn) = y1.Thus we can solve for the xs in terms of they s:

x1 = y1 y2 ynx2 = y1+ y2

x3 = y1+ y3 (2)

...

xn= y1+ yn

The Jacobian of the transformation is

dn=(x1, . . . , xn)

(y1, . . . , yn) =

1 1 1 11 1 0 01 0 1 0...1 0 0 1

To see the pattern, look at the 4 by 4 case and expand via the last row:

1 1 1 11 1 0 01 0 1 01 0 0 1

= (1)

1 1 11 0 00 1 0

+1 1 11 1 01 0 1

sod4 = 1 + d3. In general, dn= 1 + dn1, and since d2 = 2 by inspection, we havedn= nfor all n 2. Now

ni=1

(xi )2 =

(xi x + x )2 =

(xi x)2 + n(x )2 (3)

because

(xix) = 0. By (2), x1x= x1y1 = y2 ynand xix= xiy1 = yifori = 2, . . . , n. (Remember that y1 = x.) Thus

ni=1

(xi x)2 = (y2 yn)2 +ni=2

y2i (4)

Now

fY1Yn(y1, . . . , yn) = nfX1Xn(x1, . . . , xn).


19/113

18

By (3) and (4), the right side becomes, in terms of the yis,

n

12

n

exp

122

ni=2

yi2 ni=2

y2i n(y1 )2

.

The joint density of Y1, . . . , Y n is a function of y1 times a function of (y2, . . . , yn), soY1 and (Y2, . . . , Y n) are independent. Since X = Y1 and [by (4)] S

2 is a function of(Y2, . . . , Y n),

X and S2 are independent

Dividing Equation (3) by 2 we have

n

i=1

Xi

2

=nS2

2

+ X

/n2

.

But (Xi )/ is normal (0,1) and

X /

n

is normal (0,1)

so2(n) = (nS2/2) + 2(1) with the two random variables on the right independent. IfM(t) is the moment-generating function ofnS2/2, then (12t)n/2 =M(t)(12t)1/2.ThereforeM(t) = (1 2t)(n1)/2, i.e.,

nS2

2 is 2(n 1)

The random variable

T = X S/

n 1

is useful in situations where is to be estimated but the true variance 2 is unknown. Itturns out that Thas a Tdistribution, which we study in the next lecture.

4.3 Performance of Various Estimates

Let S2 be the sample variance of iid normal (, 2) random variables X1, . . . , X n. Wewill look at estimates of2 of the formcS2, wherec is a constant. Once again employing

the centralizing technique, we writeE[(cS2 2)2] = E[(cS2 cE(S2) + cE(S2) 2)2]

which simplifies to

c2 Var S2 + (cE(S2) 2)2.


20/113

19

SincenS2/2 is2(n1), which has variance 2(n1), we haven2(Var S2)/4 = 2(n1).Also nE(S2)/2 is the mean of2(n

1), which is n

1. (Or we can recall from (4.1)

thatE(S2) = (n 1)2/n.) Thus the mean square error isc224(n 1)

n2 +

c

(n 1)n

2 22.We can drop the 4 and use n2 as a common denominator, which can also be dropped.We are then trying to minimize

c22(n 1) + c2(n 1)2 2c(n 1)n + n2.Differentiate with respect to c and set the result equal to zero:

4c(n 1) + 2c(n 1)2 2(n 1)n= 0.Dividing by 2(n 1), we have 2c+c(n 1) n = 0, so c = n/(n+ 1). Thus the bestestimate of the form cS2 is

1

n + 1

ni=1

(Xi X)2.

If we use S2 then c = 1. If we us the unbiased version then c = n/(n1). Since[n/(n+ 1)] < 1 < [n/(n 1)] and a quadratic function decreases as we move towardits minimum, w see that the biased estimate S2 is better than the unbiased estimatenS2/(n1), but neither is optimal under the minimum mean square error criterion.Explicitly, whenc = n/(n 1) we get a mean square error of 24/(n 1) and whenc = 1we get

4

n2

2(n 1) + (n 1 n)2 =(2n 1)4n2

which is always smaller, because [(2n

1)/n2] < 2/(n

1) iff 2n2 > 2n2

3n+ 1 iff

3n >1, which is true for every positive integer n.

For largen all these estimates are good and the difference between their performanceis small.

Problems

1. Let X1, . . . , X n be iid, each normal (, 2), and let Xbe the sample mean. Ifc is a

constant, we wish to maken large enough so that P{ c < X < + c} .954. Findthe minimum value ofn in terms of2 andc. (It is independent of.)

2. Let X1, . . . , X n1 , Y1, . . . Y n2 be independent random variables, with the Xi normal(1, 21) and the Yi normal (2,

22). IfX is the sample mean of the Xi and Y is the

sample mean of the Yi, explain how to compute the probability thatX > Y.

3. LetX1, . . . , X nbe iid, each normal (, 2), and letS2 be the sample variance. Explainhow to compute P{a < S2 < b}.

4. Let S2 be the sample variance of iid normal (, 2) random variables Xi, i= 1 . . . , n.Calculate the moment-generating function ofS2 and from this, deduce that S2 has agamma distribution.


21/113

20

Lecture 5. The T and F Distributions


TheTdistributionis defined as follows. Let X1 and X2 be independent, with X1 normal(0,1) andX2chi-square withr degrees of freedom. The random variable Y1 =

rX1/

X2

has the Tdistribution with r degrees of freedom.

To find the density ofY1, let Y2 = X2. ThenX1 = Y1

Y2/

r and X2 = Y2. Thetransformation is one-to-one with < X1 < , X2 >0 < Y1 < , Y2 >0.The Jacobian is given by

(x1, x2)

(y1, y2) =

y2/r y1/(2

ry2)0 1

= y2/r.ThusfY1Y2(y1, y2) = fX1X2(x1, x2)

y2/r, which upon substitution forx1and x2becomes

12

exp[y21y2/2r] 1(r/2)2r/2 y(r/2)1

2 ey2/2

y2/r.

The density ofY1 is

12(r/2)2r/2

0

y[(r+1)/2]12 exp[(1 + (y21/r))y2/2] dy2/

r.

Withz = (1 + (y21/r))y2/2 and the observation that all factors of 2 cancel, this becomes(withy1 replaced by t)

((r+ 1)/2)r(r/2)

1

(1 + (t2/r))(r+1)/2, < t < ,

theTdensitywithr degrees of freedom.

In sampling from a normal population, (X )/(/n) is normal (0,1), and nS2

/

2

is2(n 1). Thus

n 1 (X )/

n

divided by

nS/ is T(n 1).

Since and

n disappear after cancellation, we have

X S/

n 1 is T(n 1)

Advocates of defining the sample variance with n 1 in the denominator point out thatone can simply replace byS in (X )/(/n) to get the T statistic.

Intuitively, we expect that for large n, (X )/(S/n 1) has approximately thesame distribution as (X)/(/n), i.e., normal (0,1). This is in fact true, as suggestedby the following computation:

1 +t2

r

(r+1)/2=

1 +

t2

r

r 1 +

t2

r

1/2

et2 1 = et2/2

as r .


22/113

21

5.2 A Preliminary Calculation

Before turning to theFdistribution, we calculate the density ofU=X1/X2whereX1andX2 are independent, positive random variables. Let Y =X2, so that X1 =U Y , X 2 =Y(X1, X2, U , Y ) are all greater than zero). The Jacobian is

(x1, x2)

(u, y) =

y u0 1 =y.

Thus fUY(u, y) = fX1X2(x1, x2)y= fX1(uy)fX2(y), and the density ofU is

h(u) =

0

yfX1(uy)fX2(y) dy.

Now we take X1 to be 2(m), andX2 to be

2(n). The density ofX1/X2 is

h(u) = 12(m+n)/2(m/2)(n/2)

u(m/2)10

y[(m+n)/2]1ey(1+u)/2 dy.

The substitution z = y(1 + u)/2 gives

h(u) = 1

2(m+n)/2(m/2)(n/2)u(m/2)1

0

z[(m+n)/2]1

[(1 + u)/2][(m+n)/2]1ez

2

1 + udz.

We abbreviate (a)(b)/(a + b) by(a, b). (We will have much more to say about thiswhen we discuss the beta distribution later in the lecture.) The above formula simplifiesto

h(u) = 1

(m/2, n/2)

u(m/2)1

(1 + u)(m+n)/2, u

0.


The F density is defined as follows. LetX1 and X2 be independent, with X1 = 2(m)

andX2 = 2(n). WithUas in (5.2), let

W =X1/m

X2/n =

n

mU

so that

fW(w) = fU(u)

du

dw

=m

nfU

m

nw

.

Thus Whas density

(m/n)m/2

(m/2, n/2)

w(m/2)1

[1 + (m/n)w](m+n)/2, w 0,

theF densitywith m and n degrees of freedom.


23/113

22

5.4 Definitions and Calculations

Thebeta function is given by

(a, b) =

10

xa1(1 x)b1 dx, a, b >0.

We will show that

(a, b) =(a)(b)

(a + b)

which is consistent with our use of(a, b) as an abbreviation in (5.2). We make the changeof variable t = x2 to get

(a) =0

ta1et dt= 20

x2a1ex2 dx.

We now use the familiar trick of writing (a)(b) as a double integral and switching topolar coordinates. Thus

(a)(b) = 4

0

0

x2a1y2b1e(x2+y2) dxdy

= 4

/20

d

0

(cos )2a1(sin )2b1er2

r2a+2b1 dr.

The change of variable u = r2 yields

0

r2a+2b1er2

dr= (1/2)

0

ua+b1eu du= (a + b)/2.

Thus

(a)(b)

2(a + b)=

/20

(cos )2a1(sin )2b1 d.

Let z = cos2 , 1 z = sin2 ,dz =2cos sin d =2z1/2(1 z)1/2 dz. The aboveintegral becomes

1

2 0

1

za1(1

z)b1 dz =1

2 1

0

za1(1

z)b1 dz =1

2(a, b)

as claimed. The beta density is

f(x) = 1

(a, b)xa1(1 x)b1, 0 x 1 (a,b >0).


24/113

23

Problems

1. LetXhave the beta distribution with parametersa andb. Find the mean and varianceofX.

2. Let T have the Tdistribution with 15 degrees of freedom. Find the value ofc whichmakes P{c T c} =.95.

3. Let W have the F distribution with m and n degrees of freedom (abbreviated W =F(m, n)). Find the distribution of 1/W.

4. A typical table of theFdistribution gives values ofP{W c} forc = .9, .95, .975 and.99. Explain how to find P{W c} for c = .1, .05, .025 and .01. (Use the result ofProblem 3.)

5. Let X have the T distribution with n degrees of freedom (abbreviated X = T(n)).Show that T2(n) = F(1, n), in other words, T2 has an F distribution with 1 and ndegrees of freedom.

6. IfXhas the exponential density ex, x 0, show that 2X is 2(2). Deduce that thequotient of two exponential random variables is F(2, 2).


25/113

1

Lecture 6. Order Statistics

6.1 The Multinomial Formula

Suppose we pick a letter from{A , B, C}, with P(A) =p1 =.3, P(B) =p2 =.5, P(C) =p3 = .2. If we do this independently 10 times, we will find the probability that theresulting sequence contains exactly 4 As, 3B s and 3Cs.

The probability ofAAAABBBCC C, in that order, isp41p32p

33. To generate all favorable

cases, select 4 positions out of 10 for the As, then 3 positions out of the remaining 6 for theBs. The positions for theCs are then determined. One possibility is BCAABACCAB.The number of favorable cases is

10

4

6

3

=

10!

4!6!

6!

3!3!=

10!

4!3!3!.

Therefore the probability of exactly 4 As,3B s and 3 Cs is

10!

4!3!3!(.3)4(.5)3(.2)3

In general, consider n independent trials such that on each trial, the result is ex-actly one of the events A1, . . . , Ar, with probabilities p1, . . . , pr respectively. Then theprobability that A1 occurs exactly n1 times, . . . , Ar occurs exactly nr times, is

pn11 pnrr

n

n1

n n1

n2

n n1 n2

n3

n n1 nr2nr1

n4nr

which reduces to the multinomial formula

n!n1! nr!p

n11 pnrr

where the pi are nonnegative real numbers that sum to 1, and the ni are nonnegativeintegers that sum to n.

Now let X1, . . . , X n be iid, each with density f(x) and distribution function F(x).Let Y1 < Y2 x} =ni=1

P{Xi > x} = [1 F(x)]n.


26/113

2

Therefore

FY1(x) = 1 [1 F(x)]n and fY1(x) = n[1 F(x)]n1f(x).We compute fYk(x) by asking how it can happen that x Yk x+ dx (see Figure6.1). There must be k 1 random variables less than x, one random variable betweenx and x+dx, and n k random variables greater than x. (We are taking dx so smallthat the probability that more one random variable falls in [x, x+dx] is negligible, andP{Xi > x} is essentially the same as P{Xi > x+ dx}. Not everyone is comfortablewith this reasoning, but the intuition is very strong and can be made precise.) By themultinomial formula,

fYk(x) dx= n!

(k 1)!1!(n k)! [F(x)]k1f(x) dx[1 F(x)]nk

so

fYk(x) = n!

(k 1)!1!(n k)! [F(x)]k1[1 F(x)]nkf(x).

Similar reasoning (see Figure 6.2) allows us to write down the joint density fYjYk(x, y) ofYj andYk forj < k , namely

n!

(j 1)!(k j 1)!(n k)! [F(x)]j1[F(y) F(x)]kj1[1 F(y)]nkf(x)f(y)

for x < y, and 0 elsewhere. [We drop the term 1! (=1), which we retained for emphasisin the formula for fYk(x).]

k-1 1 n-k

x x + dx' '

Figure 6.1

1

x x + dx' '

y y + dy

j-1 k-j-1 1 n-k

' '

Figure 6.2

Problems

1. LetY1 < Y2 < Y3 be the order statistics ofX1, X2 and X3, where theXi are uniformlydistributed between 0 and 1. Find the density ofZ= Y3 Y1.

2. The formulas derived in this lecture assume that we are in the continuous case (thedistribution functionFis continuous). The formulas do not apply if theXiare discrete.Why not?


27/113

3

3. Consider order statistics where theXi, i= 1, . . . , n, are uniformly distributed between0 and 1. Show that Yk has a beta distribution, and express the parameters and in

terms ofk andn.

4. In Problem 3, let 0 < p < 1, and express P{Yk > p} as the probability of an eventassociated with a sequence ofn Bernoulli trials with probability of success p on a giventrial. WriteP{Yk > p} as a finite sum involving n, p and k .


28/113

4

Lecture 7. The Weak Law of Large Numbers

7.1 Chebyshevs Inequality

(a) IfX 0 and a >0, then P{X a} E(X)/a.(b) If X is an arbitrary random variable, c any real number, and > 0, m > 0, thenP{|X c| } E(|X c|m)/m.(c) IfXhas finite mean and finite variance2, then P{|X | k} 1/k2.

This is a universal bound, but it may be quite weak in a specific cases. For example,ifXis normal (, 2), abbreviated N(, 2), then

P{|X | 1.96} =P{|N(0, 1)| 1.96} = 2(1 (1.96)) =.05where is the distribution function of a normal (0,1) random variable. But the Chebyshevbound is 1/(1.96)2 =.26.

Proof.(a) IfXhas density f, then

E(X) =

0

xf(x) dx=

a0

xf(x) dx +

a

xf(x) dx

so

E(X) 0 + a

af(x) dx= aP{X a}.

(b)P{|X c| } =P{|X c|m m} E(|X c|m)/m by (a).(c) By (b) with c = , = k, m= 2, we have

P{|X | k} E[(X

)2]

k22 =

1

k2 .

7.2 Weak Law of Large Numbers

LetX1, . . . , X nbe iid with finite meanand finite variance2. For largen, the arithmetic

average of the observations is very likely to be very close to the true mean . Formally,ifSn= X1+ + Xn, then for any >0,

P{Snn

} 0 asn .Proof.

P{Snn

} =P{|Sn n| n} E[(Sn n)2]n22

by Chebyshev (b). The term on the right is

Var Snn22

= n2

n22 =

2

n2 0.


29/113

5

7.3 Bernoulli Trials

LetXi= 1 if there is a success on trial i, and Xi= 0 if there is a failure. ThusXi is theindicator of a success on trial i, often written as I[Success on trial i]. Then Sn/n is therelative frequency of success, and for large n, this is very likely to be very close to thetrue probability p of success.


The convergence illustrated by the weak law of large numbers is called convergence in

probability. Explicitly, Sn/n converges in probability to . In general,XnP Xmeans

that for every >0, P{|Xn X| } 0 asn . Thus for large n, Xn is very likelyto be very close to X. IfXn converges in probability to X, then Xn converges to X indistribution: IfFn is the distribution function ofXn and Fis the distribution functionofX, then Fn(x)

F(x) at every x whereF is continuous. To see that the continuity

requirement is needed, look at Figure 7.1. In this example,Xn is uniformly distributedbetween 0 and 1/n, and X is identically 0. We haveXn

P 0 because P{|Xn| } isactually 0 for large n. However, Fn(x) F(x) for x = 0, but not atx = 0.

To prove that convergence in probability implies convergence in distribution:

Fn(x) = P{Xn x} =P{Xn x,X > x + } + P{Xn x, X x + } P{|Xn X| } + P{X x + }=P{|Xn X| } + F(x + )

F(x ) = P{X x } =P{X x , Xn> x} + P{X x , Xn x} P{|Xn X| } + P{Xn x}=P{|Xn X| } + Fn(x).

Therefore

F(x ) P{|Xn X| } Fn(x) P{|Xn X| } + F(x + ).

SinceXn converges in probability to X, we haveP{|XnX| } 0 asn . IfF iscontinuous atx, thenF(x) andF(x + ) approachF(x) as 0. ThusFn(x) is boxedbetween two quantities that can be made arbitrarily close to F(x), so Fn(x) F(x).

7.5 Some Sufficient Conditions

In practice,P{|XnX| } may be difficult to compute, and it is useful to have sufficientconditions for convergence in probability that can often be easily checked.

(1) IfE[(Xn X)2

] 0 as n , thenXnP

X.(2) IfE(Xn) E(X) and Var(Xn X) 0, thenXn P X.Proof. The first statement follows from Chebyshev (b):

P{|Xn X| } E[(Xn X)2]

2 0.


30/113

6

To prove (2), note that

E[(Xn X)2

] = Var(Xn X) + [E(Xn) E(X)]2

0. In this result, if X is identically equal to a constant c, then Var(Xn X) is simplyVar Xn. Condition (2) then becomesE(Xn) cand Var Xn 0, which implies thatXnconverges in probability toc.

7.6 An Application

In normal sampling, let S2n be the sample variance based on n observations. Lets show

thatS2n is a consistent estimateof the true variance2, that is, S2n

P 2. SincenS2n/2is2(n 1), we haveE(nS2n/2) = (n 1) and Var(nS2n/2) = 2(n 1). Thus E(S2n) =(n 1)2/n 2 and Var(S2n) = 2(n 1)4/n2 0, and the result follows.

'1/n

F (x)n

x

1 F (x)n

olimn 1

F(x)

x

o

1

Figure 7.1

Problems

1. Let X1, . . . , X n be independent, not necessarily identically distributed random vari-ables. Assume that the Xi have finite means i and finite variances

2i , and the

variances are uniformly bounded, i.e., for some positive number Mwe have 2i Mfor all i. Show that (Sn

E(Sn))/n converges in probability to 0. This is a general-

ization of the weak law of large numbers. For ifi = and 2i = 2 for all i, then

E(Sn) = n, so (Sn/n) P 0, i.e., Sn/n P .2. Toss an unbiased coin once. If heads, write down the sequence 10101010 . . . , and if

tails, write down the sequence 01010101 . . . . IfXn is the n-th term of the sequenceandX=X1, show that Xn converges to X in distribution but not in probability.


31/113

7

3. LetX1, . . . , X nbe iid with finite mean and finite variance2. LetXnbe the sample

mean (X1 +

+Xn)/n. Find the limiting distribution of Xn, i.e., find a random

variable Xsuch that Xn d X.4. LetXn be uniformly distributed between n and n + 1. Show that Xn does not have a

limiting distribution. Intuitively, the probability has run away to infinity.


32/113

8

Lecture 8. The Central Limit Theorem

Intuitively, any random variable that can be regarded as the sum of a large numberof small independent components is approximately normal. To formalize, we need thefollowing result, stated without proof.

8.1 Theorem

IfYn has moment-generating function Mn, Y has moment-generating function M, andMn(t) M(t) as n for all t in some open interval containing the origin, thenYn

d Y.

8.2 Central Limit Theorem

LetX1, X2, . . . be iid, each with finite mean , finite variance2, and moment-generating

functionM. Then

Yn=

ni=1 Xi n

n

converges in distribution to a random variable that is normal (0,1). Thus for large n,ni=1 Xi is approximately normal.

We will give an informal sketch of the proof. The numerator ofYn isn

i=1(Xi ),and the random variables Xi are iid with mean 0 and variance 2. Thus we mayassume without loss of generality that = 0. We have

MYn(t) = E[etYn ] = E

exp

t

n

n

i=1Xi

.

The moment-generating function ofn

i=1 Xi is [M(t)]n, so

MYn(t) =

M t

n

n.

Now if the density of the Xi isf(x), then

M t

n

=

exp tx

n

f(x) dx

=

1 +

txn

+ t2x2

2!n2+

t3x3

3!n3/23+ f(x) dx

= 1 + 0 + t2

2n+ t

3

36n3/23

+ t4

424n24

+

wherek = E[(Xi)k]. If we neglect the terms after t2/2n we have, approximately,

MYn(t) =

1 + t2

2n

n


33/113

9

which approaches the normal (0,1) moment-generating function et2/2 as n . This

argument is very loose but it can be made precise by some estimates based on Taylors

formula with remainder.

We proved that ifXn converges in probability to X, thenXn convergence in distribu-tion to X. There is a partial converse.

8.3 Theorem

IfXn converges in distribution to a constantc, then Xn converges in probability to X.

Proof. We estimate the probability that|Xn X| , as follows.P{|Xn X| } =P{Xn c + } + P{Xn c }

= 1

P

{Xn< c +

}+ P

{Xn

c

}NowP{Xn c + (/2)} P{Xn< c + }, so

P{|Xn c| } 1 P{Xn c + (/2)} + P{Xn c }

= 1 Fn(c + (/2)) + Fn(c ).

whereFn is the distribution function ofXn. But as long asx =c,Fn(x) converges to thedistribution function of the constant c, so Fn(x) 1 ifx > c, and Fn(x)0 ifx < c.ThereforeP{|Xn c| } 1 1 + 0 = 0 as n .

8.4 Remarks

IfY is binomial (n, p), the normal approximation to the binomialallows us to regard Yas approximately normal with mean np and variance npq (with q = 1 p). Accordingto Box, Hunter and Hunter, Statistics for Experimenters, page 130, the approximationworks well in practice ifn >5 and

1n

qp

p

q

< .3If, for example, we wish to estimate the probability thatY= 50 or 51 or 52, we may writethis probability as P{49.5 < Y < 52.5} , and then evaluate as if Y were normal withmean np and variance np(1 p). This turns out to be slightly more accurate in practicethan using P{50 Y 52}.

8.5 Simulation

Most computers an simulate a random variable that is uniformly distributed between 0and 1. But what if we need a random variable with an arbitrary distribution functionF?For example, how would we simulate the random variable with the distribution functionof Figure 8.1? The basic idea is illustrated in Figure 8.2. IfY =F(X) where Xhas the


34/113

10

continuous distribution function F, then Y is uniformly distributed on [0,1]. (In Figure8.2 we have, for 0

y

1, P

{Y

y

}=P

{X

x

}=F(x) = y.)

Thus ifXis uniformly distributed on [0,1] and w want Yto have distribution functionF, we setX= F(Y), Y = F1(X).

In Figure 8.1 we must be more precise:

Case 1. 0 X 3. LetX= (3/70)Y+ (15/70), Y= (70X 15)/3.Case 2. .3 X .8. LetY= 4, so P{Y = 4} =.5 as required.Case 3. .8 X 1. Let X= (1/10)Y + (4/10), Y = 10X 4.

In Figure 8.1, replace the F(y)-axis by an x-axis to visualize X versus Y. Ify = y0corresponds tox = x0 [i.e., x0 = F(y0)], then

P{Y y0} =P{X x0} =x0 = F(y0)

as desired.

-5 2 4 6

o

.3

.8

-

-

'' '

1F(y)

y

110y+4

10

3

70y+

15

70

Figure 8.1

Y = F(X)

X

y

x

1

Figure 8.2

Problems

1. Let Xn be gamma (n, ), i.e., Xn has the gamma distribution with parameters n and. Show that Xn is a sum ofn independent exponential random variables, and fromthis derive the limiting distribution ofXn/n.

2. Show that2(n) is approximately normal for largen (with mean n and variance 2n).


35/113

11

3. Let X1, . . . , X n be iid with density f. Let Yn be the number of observations thatfall into the interval (a, b). Indicate how to use a normal approximation to calculate

probabilities involvingYn.

4. If we have 3 observations 6.45, 3.14, 4.93, and we round off to the nearest integer, weget 6, 3, 5. The sum of integers is 14, but the actual sum is 14.52. Let Xi, i= 1, . . . , nbe the round-off error of the i-th observation, and assume that the Xi are iid anduniformly distributed on (1/2, 1/2). Indicate how to use a normal approximation tocalculate probabilities involving the total round off error Yn=

ni=1 Xi.

5. Let X1, . . . , X n be iid with continuous distribution function F, and let Y1 < < Ynbe the order statistics of the Xi. ThenF(X1), . . . , F (Xn) are iid and uniformly dis-tributed on [0,1] (see the discussion of simulation), with order statistics F(Y1), . . . , F (Yn).Show that n(1 F(Yn)) converges in distribution to an exponential random variable.


36/113

12

Lecture 9. Estimation

9.1 Introduction

In effect the statistician plays a game against nature, who first chooses the state ofnature (a number or k-tuple of numbers in the usual case) and performs a randomexperiment. We do not know but we are allowed to observe the value of a randomvariable (or random vector)X, called the observable, with density f(x).

After observingX=x we estimate by (x), which is called a point estimatebecauseit produces a single number which we hope is close to . The main alternative is aninterval estimate or confidence interval, which will be discussed in Lectures 10 and 11.

For a point estimate (x) to make sense physically, it must depend only onx, not onthe unknown parameter . There are many possible estimates, and there are no generalrules for choosing a best estimate. Some practical considerations are:

(a) How much does it cost to collect the data?

(b) Is the performance of the estimate easy to measure, for example, can we computeP{|(x) | < }?(c) Are the advantages of the estimate appropriate for the problem at hand?

We will study several estimation methods:

1. Maximum likelihood estimates.

These estimates usually have highly desirable theoretical properties (consistency), andare frequently not difficult to compute.

2. Confidence intervals.

These estimates have a very useful practical feature. We construct an interval from

the data, and we will know the probability that our (random) interval actually containsthe unknown (but fixed) parameter.

3. Uniformly minimum variance unbiased estimates (UMVUEs).

Mathematical theory generates a large number of examples of these, but as we know,a biased estimate can sometimes be superior.

4. Bayes estimates.

These estimates are appropriate if it is reasonable to assume that the state of nature is a random variable with a known density.

In general, statistical theory produces many reasonable candidates, and practical ex-perience will dictate the choice in a given physical situation.

9.2 Maximum Likelihood Estimates

We choose(x) = , a value of that makes what we have observed as likely as possible.

In other words, let maximize the likelihood function L() = f(x), with x fixed. Thiscorresponds to basic statistical philosophy; if what we have observed is more likely under2 than under 1, we prefer 2 to1.


37/113

13

9.3 Example

LetXbe binomial (n, ). Then the probability that X= x when the true parameter is is

f(x) =

n

x

x(1 )nx, x= 0, 1, . . . , n.

Maximizing f(x) is equivalent to maximizing ln f(x):

ln f(x) =

[x ln + (n x)ln(1 )] = x

n x

1 = 0.

Thus x x n + x= 0, so = X/n, the relative frequency of success.Notation: will be written in terms of random variables, in this case X/n rather than

x/n. Thus is itself a random variable.

We haveE() = n/n= , so is unbiased. By the weak law of large numbers, P ,i.e., is consistent

9.4 Example

LetX1, . . . , X n be iid, normal (, 2), = (, 2). Then, with x = (x1, . . . , xn),

f(x) =

1

2

nexp

ni=1

(xi )222

and

ln f(x) = n

2ln 2 n ln 1

22

ni=1

(xi )2

;

ln f(x) =

1

2

ni=1

(xi ) = 0,ni=1

xi n= 0, = x;

ln f(x) = n

+

1

3

ni=1

(xi )2 = n3 2 +1

n

ni=1

(xi )2

= 0

with = x. Thus

2 = 1

n

ni=1

(xi x)2 =s2.

Case 1. and are both unknown. Then = (X, S2).

Case 2. 2 is known. Then = and = Xas above. (Differentiation with respect to is omitted.)


38/113

14

Case 3. is known. Then = 2 and the equation (/) ln f(x) = 0 becomes

2 = 1n

ni=1

(xi )2

so

= 1

n

ni=1

(Xi )2.

The sample mean X is an unbiased and (by the weak law of large numbers) consistentestimate of . The sample variance S2 is a biased but consistent estimate of 2 (seeLectures 4 and 7).

Notation: We will abbreviate maximum likelihood estimate by MLE.

9.5 The MLE of a Function of Theta

Suppose that for a fixed x, f(x) is a maximum when = 0. Then the value of2 when

f(x) is a maximum is 20. Thus to get the MLE of

2, we simply square the MLE of .In general, ifh is any function, then h() = h().Ifh is continuous, then consistency is preserved, in other words:

Ifh is continuous and P , then h() P h().

Proof. Given > 0, there exists > 0 such that if| | < , then|h() h()| < .Consequently,

P{|h() h()| } P{|

| } 0 as n .

(To justify the above inequality, note that if the occurrence of an event A implies theoccurrence of an eventB , then P(A) P(B).)

9.6 The Method of Moments

This is sometimes a quick way to obtain reasonable estimates. We set the observed k-thmomentn1

ni=1 x

ki equal to the theoreticalk-th momentE(X

ki) (which will depend on

the unknown parameter). Or we set the observedk-th central momentn1n

i=1(xi)kequal to the theoretical k-th central moment E[(Xi )k]. For example, letX1, . . . X nbe iid, gamma with = 1, = 2, with 1, 2 > 0. Then E(Xi) = = 12 andVar Xi=

2 =122 (see Lecture 3). We set

X= 12, S2 =122

and solve to get estimates i ofi, i= 1, 2, namely

2 =S2

X, 1 =

X

2=

X2

S2


39/113

15

Problems

1. In this problem,X1, . . . , X n are iid with density f(x) or probability function p(x),and you are asked to find the MLE of.

(a) Poisson (), >0.

(b) f(x) = x1, 0< x < 1, where > 0. The probability is concentrated near the

origin when 1.

(c) Exponential with parameter , i.e., f(x) = (1/)ex/, x >0, where >0.

(d)f(x) = (1/2)e|x|, where andx are arbitrary real numbers.

(e) Translated exponential, i.e., f(x) =e(x), where is an arbitrary real number

and x .2. let X1, . . . , X n be iid, each uniformly distributed between (1/2) and + (1/2).

Find more than one MLE of (so MLEs are not necessarily unique).

3. In each part of Problem 1, calculateE(Xi) and derive an estimate based on the methodof moments by setting the sample mean equal to the true mean. In each case, showthat the estimate is consistent.

4. LetXbe exponential with parameter , as in Problem 1(c). Ifr >0, find the MLE ofP{X r}.

5. IfXis binomial (n, ) and a and b are integers with 0abn, find the MLE ofP{a X b}.


40/113

16

Lecture 10. Confidence Intervals

10.1 Predicting an Election

There are two candidates A and B . If a voter is selected at random, the probability thatthe voter favors A is p, where p is fixed but unknown. We selectn voters independentlyand ask their preference.

The number Yn of A voters is binomial (n, p), which (for sufficiently large n), isapproximately normal with = np and 2 = np(1 p). The relative frequency of Avoters is Yn/n. We wish to estimate the minimum value ofn such that we can predictAs percentage of the vote within 1 percent, with 95 percent confidence. Thus we want

PYn

np < .01 > .95.

Note that|

(Yn/n)

p|< .01 means that p is within .01 ofYn/n. So this inequality can

be written as

Ynn

.01< p < Ynn

+ .01.

Thus the probability that the randominterval In= ((Yn/n) .01, (Yn/n) + .01) containsthe true probability p is greater than .95. We say that In is a 95 percent confidenceinterval forp.

In general, we find confidence intervals by calculating or estimating the probability ofthe event that is to occur with the desired level of confidence. In this case,

P

Ynn

p

< .01

=P{|Yn np| < .01n} =P

Yn np

np(1 p)

< .01

n

p(1 p)

and this is approximately

.01

n

p(1 p)

.01np(1 p)

= 2

.01

n

p(1 p)

1> .95

where is the normal (0,1) distribution function. Since 1.95/2 = .975 and (1.96) =.975,we have

.01

np(1 p) >1.96, n > (196)

2p(1 p).

But (by calculus) p(1 p) is maximized when 1 2p = 0, p = 1/2, p(1 p) = 1/4.Thus n >(196)2/4 = (98)2 = (100

2)2 = 10000

400 + 4 = 9604.

If we want to get within one tenth of one percent (.001) ofpwith 99 percent confidence,we repeat the above analysis with .01 replaced by .001, 1.99/2=.995 and (2 .6) = .995.Thus

.001

np(1 p) >2.6, n > (2600)

2/4 = (1300)2 = 1, 690, 000.


41/113

17

To get within 3 percent with 95 percent confidence, we have

.03np(1 p) >1.96, n > 1963

2

14

= 1067.

If the experiment is repeated independently a large number of times, it is very likely thatour result will be within .03 of the true probabilityp at least 95 percent of the time. Theusual statement The margin of error of this poll is3% does not capture this idea.

Note that the accuracy of the prediction depends only on the number of voters polledand not in total number of votes in the population. But the model assumes samplingwith replacement. (Theoretically, the same voter can be polled more than once since thevoters are selected independently.) In practice, sampling is done without replacement,but if the number n of voters polled is small relative to the population size N, the erroris very small.

The normal approximation to the binomial (based on the central limit theorem) is

quite reliable, and is used in practice even for modest values ofn; see (8.4).

10.2 Estimating the Mean of a Normal Population

LetX1, . . . , X n be iid, each normal (, 2). We will find a confidence interval for .

Case 1. The variance 2 is known. Then Xis normal (, 2/n), so

X /

n

is normal (0,1),

hence

P{b < n

X

< b} = (b) (b) = 2(b) 1

and the inequality defining the confidence interval can be written as

X bn

< < X+ b

n.

We choose a symmetrical interval to minimize the length, because the normal densitywith zero mean is symmetric about 0. The desired confidence level determines b, whichthen determines the confidence interval.

Case 2. The variance 2 is unknown. Recall from (5.1) that

X S/

n 1 is T(n 1)

hence

P{b < X S/

n 1 < b} = 2FT(b) 1

and the inequality defining the confidence interval can be written as

X bSn 1 < < X+

bSn 1 .


42/113

18

10.3 A Correction Factor When Sampling Without Replacement

The following results will not be used and may be omitted, but it is interesting to measurequantitatively the effect of sampling without replacement. In the election predictionproblem, let Xi be the indicator of success (i.e.,selecting an A voter) on trial i. ThenP{Xi = 1} =p and P{Xi = 0} = 1 p. If sampling is done with replacement, then theXi are independent and the total numberX=X1+ + Xn ofA voters in the sample isbinomial (n, p). Thus the variance ofXis np(1p). However, if sampling is done withoutreplacement, then in effect we are drawing n balls from an urn containing Nballs (whereNis the size of the population), with N pballs labeledA and N(1 p) labeledB . Recallfrom basic probability theory that

Var X=ni=1

Var Xi+ 2i


43/113

19

Problems

1. In the normal case [see (10.2)], assume that2

is known. Explain how to compute thelength of the confidence interval for .

2. Continuing Problem 1, assume that2 is unknown. Explain how to compute the lengthof the confidence interval for , in terms of the sample standard deviation S.

3. Continuing Problem 2, explain how to compute the expected length of the confidenceinterval for , in terms of the unknown standard deviation . (Note that when isunknown, we expect a larger interval since we have less information.)

4. Let X1, . . . , X n be iid, each gamma with parameters and . If is known, explainhow to compute a confidence interval for the mean = .

5. In the binomial case [see (10.1)], suppose we specify the level of confidence and thelength of the confidence interval. Explain how to compute the minimum value ofn.


44/113

1

Lecture 11. More Confidence Intervals

11.1 Differences of Means

Let X1, . . . , X n be iid, each normal (1, 2), and let Y1, . . . , Y m be iid, each normal

(2, 2). Assume that (X1, . . . X n) and Y1, . . . , Y m) are independent. We will construct

a confidence interval for 1 2. In practice, the interval is often used in the followingway. If the interval lies entirely to the left of 0, we have reason to believe that1 < 2.

Since Var(X Y) = Var X+ Var Y = (2/n) + (2/m),X Y (1 2)

1n +

1m

is normal (0,1).

Also,nS21/2 is2(n1) andmS22/2 is2(m1). But2(r) is the sum ofr independent,

normal (0,1) random variables, so

nS212

+mS22

2 is 2(n + m 2).

Thus if

R=

nS21 + mS

22

n + m 2

1

n+

1

m

then

T =X Y (1 2)

R is T(n + m 2).

Our assumption that both populations have the same variance is crucial, because theunknownvariance can be cancelled.

IfP{b < T < b} =.95 we get a 95 percent confidence interval for 1 2:

b < X Y (1 2)R

< b

or

(X Y) bR < 1 2 < (X Y) + bR.If the variances 21 and

22 areknown but possibly unequal, then

X Y (1 2)2

1

n + 2

2

m

is normal (0,1). IfR0 is the denominator of the above fraction, we can get a 95 percentconfidence interval as before: (b) (b) = 2(b) 1> .95,

(X Y) bR0 < 1 2 < (X Y) + bR0.


45/113

2

11.2 Example

LetY1 andY2 be binomial (n1, p1) and (n2, p2) respectively. Then

Y1 = X1+ + Xn1 and Y2 = Z1+ + Zn2where the Xi and Zj are indicators of success on trials i and j respectively. Assumethat X1, . . . X n1 , Z1, . . . , Z n2 are independent. Now E(Y1/n1) = p1 and Var(Y1/n1) =n1p1(1 p1)/n21 = p1(1p1)/n1, with similar formulas for Y2/n2. Thus for large n,

Y1n1

Y2n2

(p1 p2)

divided by

p1(1p1)n1

+p2(1 p2)

n2

is approximately normal (0,1). But this expression cannot be used to construct confidenceintervals forp1p2 because the denominator involves the unknown quantitiesp1 andp2.However, Y1/n1 converges in probability to p1 and Y2/n2 converges in probability to p2,and this justifies replacing p1 byY1/n1 andp2 byY2/n2 in the denominator.

11.3 The Variance

We will construct confidence intervals for the variance of a normal population. LetX1, . . . , X n be iid, each normal (,

2), so that nS2/2 is 2(n1). If hn1 is the2(n 1) density and a and b are chosen to that ba hn1(x) dx= 1 , then

P{a < nS2

2 < b} = 1 .

Buta


46/113

3

so if

W=

ni=1

(Xi )2

and we choose a and b so thatba

hn(x) dx = 1 , then P{a < (W/2) < b} = 1 .The inequality defining the confidence interval can be written as

W

b < 2 0 might mean thatthe drug is a significant improvement.

We observex and make a decision via (x) = 0 or 1. There are two types of errors. Atype 1 erroroccurs ifH0 is true but(x) = 1, in other words, we declare that H1 is true.Thus in a type 1 error, we rejectH0 when it is true.

Atype 2 erroroccurs ifH0 is false but(x) = 0, i.e., we declare that H0 is true. Thusin a type 2 error, we acceptH0 when it is false.

IfH0 [resp. H1] means that a patient does not have [resp. does have] a particulardisease, then a type 1 error is also called a false positive, and a type 2 error is also calleda false negative.

If(x) is always 0, then a type 1 error can never occur, but a type 2 error will alwaysoccur. Symmetrically, if(x) is always 1, then there will always be a type 1 error, butnever an error of type 2. Thus by ignoring the data altogether we can reduce one of theerror probabilities to zero. To get botherror probabilities to be mall, in practice we mustincrease the sample size.

We say that H0 [resp. H1] is simple if A0 [resp. A1] contains only one element,composite if A0 [resp. H1] contains more than one element. So in the case of simplehypothesis vs. simple alternative, we are testing = 0 vs. = 1. The standard exampleis to test the hypothesis that Xhas density f0 vs. the alternative that Xhas density f1.

12.2 Likelihood Ratio Tests

In the case of simple hypothesis vs. simple alternative, if we require that the probabilityof a type 1 error be at most and try to minimize the probability of a type 2 error, theoptimal test turns out to be a likelihood ratio test (LRT), defined as follows. LetL(x),the likelihood ratio, be f1(x)/f0(x), and let be a constant. IfL(x) > , reject H0; ifL(x)< , acceptH0; ifL(x) = , do anything.

Intuitively, if what we have observed seems significantly more likely under H1, we willtend to rejectH0. IfH0 or H1 is composite, there is no general optimality result as thereis in the simple vs. simple case. In this situation, we resort to basic statistical philosophy:

If, assuming thatH0 is true, we witness a are event, we tend to reject H0.The statement that LRTs are optimal is the Neyman-Pearson lemma, to be proved

at the end of the lecture. In many common examples (normal, Poisson, binomial, ex-ponential,L(x1, . . . , xn) can be expressed as a function of the sum of the observations,or equivalently as a function of the sample mean. This motivates consideration of testsbased on

ni=1 Xi or on X.


48/113

5

12.3 Example

Let X1, . . . , X n be iid, each normal (, 2

). We will testH0 : 0 vs. H1 : > 0.UnderH1, Xwill tend to be larger, so lets reject H0 when X > c. The power functionof the test is defined by

K() = P{rejectH0},the probability of rejecting the null hypothesis when the true parameter is . In this case,

P{X > c} =P

X /

n >

c /

n

= 1

c /

n

(see Figure 12.1). Suppose that we specify the probability of a type 1 error when = 1,and the probability of a type 2 error when = 2. Then

K(1) = 1 c

1

/n =and

K(2) = 1

c 2/

n

= 1 .

If, ,,1 and2 are known, we have two equations that can be solved for c and n.

1

K( )

0 21

Figure 12.1

The critical region is the set of observations that lead to rejection. In this case, it is{(x1, . . . , xn) : n1

ni=1 xi > c}.

The significance level is the largest type 1 error probability. Here it is K(0), sinceK() increases with .

12.4 Example

Let H0 : Xis uniformly distributed on (0,1), so f0(x) = 1, 0 < x < 1, and 0 elsewhere.Let H1 : f1(x) = 3x

2, 0 < x < 1, and 0 elsewhere. We take only one observation, andrejectH0 ifx > c, where 0< c < 1. Then

K(0) =P0{X > c} = 1 c, K(1) =P1{X > c} = 1c

3x2 dx= 1 c3.


49/113

6

If we specify the probability of a type 1 error, then = 1 c, which determines c. Ifis the probability of a type 2 error, then 1

= 1

c3, so= c3. Thus (see Figure 12.2)

= (1 )3.If = .05 then= (.95)3 .86, which indicates that you usually cant do too well withonly one observation.

1

1

Figure 12.2

12.5 Tests Derived From Confidence Intervals

LetX1, . . . , X nbe iid, each normal (0, 2). In Lecture 10, we found a confidence interval

for0, assuming 2 unknown, via

P

b < X 0

S/

n 1 < b

= 2FT(b) 1 where T = X 0S/

n 1has the Tdistribution with n 1 degrees of freedom.

Say 2FT(b) 1 = .95, so that

P X 0S/n 1

b =.05If actually equals 0, we are witnessing an event of low probability. So it is natural totest = 0 vs. =0 by rejecting if X 0S/n 1

b,in other words, 0 does not belong to the confidence interval. As the true mean moves away from 0 in either direction, the probability of this event will increase, sinceX 0 = (X ) + ( 0).

Tests of = 0 vs. = 0 are called two-sided, as opposed to = 0 vs. > 0 (or= 0 vs. < 0), which are one-sided. In the present case, if we test = 0 vs. > 0,

we reject ifX 0

S/

n 1 b.

The power function K() is difficult to compute for = 0, because (X 0)/(/n)no longer has mean zero. The noncentral Tdistribution becomes involved.


50/113

7

12.6 The Neyman-Pearson Lemma

Assume that we are testing the simple hypothesis that X has density f0 vs. the simplealternative that X has density f1. Let be an LRT with parameter (a nonnegativeconstant), in other words, (x) is the probability of rejecting H0 when x is observed,and

(x) =

1 ifL(x)>

0 ifL(x)<

anything ifL(x) =

Suppose that the probability of a type 1 error using is , and the probability of atype 2 error is . Let be an arbitrary test with error probabilities and . If then . In other words, the LRT has maximum power among all tests at significancelevel .

Proof. We are going to assume that f0 and f1 are one-dimensional, but the argumentworks equally well when X= (X1, . . . , X n) and the fi aren-dimensional joint densities.We recall from basic probability theory the theorem of total probability, which says thatifXhas density f, then for any evert A,

P(A) =

P(A|X= x)f(x) dx.

A companion theorem which we will also use later is the theorem of total expectation,which says that ifXhas density f, then for any random variable Y,

E(Y) =

E(Y|X= x)f(x) dx.

By the theorem of total probability,

=

(x)f0(x) dx, 1 =

(x)f1(x) dx

and similarly

=

(x)f0(x) dx, 1 =

(x)f1(x) dx.

We claim that for all x,

[(x) (x)][f1(x) f0(x)] 0.

For iff1(x) > f0(x) then L(x) > , so (x) = 1 (x), and iff1(x) < f0(x) thenL(x) < , so (x) = 0 (x), proving the assertion. Now if a function is alwaysnonnegative, its integral must be nonnegative, so

[(x) (x)][f1(x) f0(x)] dx 0.


51/113

8

The terms involving f0translate to statements about type 1 errors, and the terms involvingf1 translate to statements about type 2 errors. Thus

(1 ) (1 ) + 0,

which says that ( ) 0, completing the proof.

12.7 Randomization

IfL(x) = , then do anything means that randomization is possible, e.g., we can flipa possibly biased coin to decide whether or not to accept H0. (This may be significantin the discrete case, where L(x) = may have positive probability.) Statisticians tendto frown on this practice because two statisticians can look at exactly the same data andcome to different conclusions. It is possible to adjust the significance level (by replacingdo anything by a definite choice of eitherH0 orH1 to avoid randomization.

Problems

1. Consider the problem of testing = 0 vs. > 0, where is the mean of a normalpopulation with known variance. Assume that the sample size n is fixed. Show thatthe test given in Example 12.3 (reject H0 ifX > c) is uniformly most powerful. Inother words, if we test = 0 vs. = 1 for any given 1 > 0, and we specify theprobability of a type 1 error, then the probability of a type 2 error is minimized.

2. It is desired to test the null hypothesis that a die is unbiased vs. the alternative thatthe die is loaded, with faces 1 and 2 having probability 1/4 and faces 3,4,5 and 6 havingprobability 1/8. The die is to be tossed once. Find a most powerful test at level = .1,and find the type 2 error probability .

3. We wish to test a binomial random variable X with n = 400 and H0 : p = 1/2 vs.H1 : p > 1/2. The random variable Y = (X np)/

np(1 p) = (X 200)/10 isapproximately normal (0,1), and we will reject H0 if Y > c. If we specify = .05,then c = 1.645. Thus the critical region is X > 216.45. Suppose the actual result isX= 220, so that H0 is rejected. Find the minimum value of (sometimes called thep-value) for which the givendata lead to theoppositeconclusion (acceptance ofH0).


52/113

9

Lecture 13. Chi-Square Tests

13.1 Introduction

Let X1, . . . , X k be multinomial, i.e., Xi is the number of occurrences of the event Ai inn generalized Bernoulli trials (Lecture 6). Then

P{X1= n1, . . . , X k = nk} = n!n1! nk!p

n11 pnkk

where the ni are nonnegative integers whose sum is n. Consider k = 2. Then X1 isbinomial (n, p1) and (X1np1)/

np1(1 p1) normal(0,1). Consequently, the random

variable (X1 np1)2/np1(1 p1) is approximately 2(1). But

(X1

np1)2

np1(1 p1) =(X1

np1)

2

n 1

p1 +

1

1 p1

=

(X1

np1)2

np1 +

(X2

np2)2

np2 .

(Note that since k = 2 we have p2 = 1 p1 and X1 np1 = n X2 np1 = np2X2 =(X2 np2), and the outer minus sign disappears when squaring.) Therefore[(X1 np1)2/np1] + [(X2 np2)2/np2] 2(1). More generally, it can be shown that

Q=k

i=1

(Xi npi)2npi

2(k 1).

where

(Xi

npi)

2

npi =

(observed frequency-expected frequency)2

expected frequency .

We will consider three types of chi-square tests.

13.2 Goodness of Fit

We ask whetherXhas a specified distribution (normal, Poisson, etc.). The null hypothesisis that the multinomial probabilities are p = (p1, . . . , pk), and the alternative is that

p = (p1, . . . , pk).Suppose that P{2(k 1) > c} is at the desired level of significance (for example,

.05). IfQ > c we will rejectH0. The idea is that ifH0 is in fact true, we have witnesseda rare event, so rejection is reasonable. IfH

0is false, it is reasonable to expect that some

of theXi will be far from npi, so Q will be large.

Some practical considerations: Taken large enough so that each npi 5. Each time aparameter is estimated from the sample, reduced the number of degrees by 1. (A typicalcase: The null hypothesis is thatX is Poisson (), but the mean is unknown, and isestimated by the sample mean.)


53/113

10

13.3 Equality of Distributions

We ask whether two or more samples come from the same underlying distribution. Theobserved results are displayed in a contingency table. This is anh k matrix whose rowsare the samples and whose columns are the attributes to be observed. For example, rowi might be (7, 11, 15, 13, 4), with the interpretation that in a class of 50 students taughtby method of instruction i, there were 7 grades ofA, 11 ofB, 15 ofC, 13 ofD and 4ofF. The null hypothesisH0 is that there is no difference between the various methodsof instruction, i.e., P(A) is the same for each group, and similarly for the probabilities ofthe other grades. We estimateP(A) from the sample by adding all entries in column Aand dividing by the total number of observations in the entire experiment. We estimateP(B), P(C), P(D) and P(F) in a similar fashion. The expected frequencies in row i arefound by multiplying the grade probabilities by the number of entries in row i.

If there arehgroups (samples), each withkattributes, then each group generates a chi-square (k

1), andk

1 probabilities are estimated from the sample (the last probability

is determined). The number of degrees of freedom is h(k 1) (k 1) = (h 1)(k 1),call it r. IfP{2(r)> c} is the desired significance level, we reject H0 if the chi-squarestatistic is greater than c.

13.4 Testing For Independence

Again we have a contingency table with h rows corresponding to the possible values xi ofa random variableX, andk columns corresponding to the possible valuesyj of a randomvariable Y . We are testing the null hypothesis that X andY are independent.

Let Ri be the sum of the entries in row i, and let Cj be the sum of the entriesin column j. Then the sum of all observations is T =

i Ri =

jCj . We estimate

P{X=xi} by Ri/T, and P{Y =yj} by Cj/T. Under the independence hypothesis H0,P{

X= xi, Y =yj}

= P{

X=xi}

P{

Y = yj}

= RiCj/T2. Thus the expected frequency

of (xi, yj) is RiCj/T. (This gives another way to calculate the expected frequencies in(13.3). In that case, we estimated the j-th column probability by Cj/T, and multipliedby the sum of the entries in row i, namelyRi.)

In an h k contingency table, the number of degrees of freedom is hk 1 minus thenumber of estimated parameters:

hk 1 (h 1 + k 1) = hk h k+ 1 = (h 1)(k 1).The chi-square statistic is calculated as in (13.3). Similarly, if there are 3 attributes tobe tested for independence and we form an h k m contingency table, the number ofdegrees of freedom is

hkm 1 (h 1) + (k 1) + (m 1) = hkm h k m + 2.

Problems

1. Use a chi-square procedure to tests the null hypothesis that a random variableX hasthe following distribution:

P{X= 1} =.5, P{X= 2} =.3, P{X= 3} =.2


54/113

11

We take 100 independent observations ofX, and it is observed that 1 occurs 40 times,2 occurs 33 times, and 3 occurs 27 times. Determine whether or not we will reject the

null hypothesis at significance level .05

2. Use a chi-square test to decide (at significance level .05) whether the two samples cor-responding to the rows of the contingency table below came from the same underlyingdistribution.

A B CSample 1 33 147 114Sample 2 67 153 86

3. Suppose we are testing for independence in a 2 2 contingency tablea bc d

Show that the chi-square statistic is

(ad bc)2(a + b + c + d)(a + b)(c + d)(a + c)(b + d)

(The number of degrees of freedom is 1 1 = 1.)


55/113

12

Lecture 14. Sufficient Statistics


Let X1, . . . , X n be iid with P{Xi = 1} = and P{Xi = 0} = 1 , so P{Xi = x} =x(1)1x, x = 0, 1. Let Y be a statistic for , i.e., a function of the observablesX1, . . . , X n. In this case we take Y =X1+ + Xn, the total number of successes in nBernoulli trials with probability of success on a given trial.

We claim that the conditional distribution ofX1, . . . , X ngivenYis free of, in otherwords, does not depend on . We say that Y is sufficient for .

To prove this, note that

P{X1 = x1, . . . , X n= xn|Y =y} = P{X1 = x1, . . . , X n= xn, Y =y}P{Y =y} .

This is 0 unless y = x1+ + xn, in which case we gety(1 )nyny

y(1 )ny =

1ny

.For example, if we know that there were 3 heads in 5 tosses, the probability that theactual tosses were HTHHT is 1/

53

.

14.2 The Key Idea

For the purpose of making a statistical decision, we can ignore the individual randomvariables Xi and base the decision entirely on X1+ + Xn.

Suppose that statistician A observes X1, . . . , X n and makes a decision. StatisticianB observes X1+ +Xn only, and constructs X1, . . . , X n according to the conditionaldistribution ofX1, . . . , X n given Y, i.e.,

P{X1 = x1, . . . , X n= xn|Y =y} = 1ny

.This construction is possible because the conditional distribution does not depend on theunknown parameter . We will show that under , (X1, . . . , X

n) and (X1, . . . , X n) have

exactly the same distribution, so anything A can do, B can do at least as well, even thoughB has less information.

Givenx1, . . . , xn, lety = x1 + + xn. The only way we can haveX1 = x1, . . . , X n=xn is ifY =y and then Bs experiment produces X1 = x1, . . . , X

n= xn given y . Thus

P{X1 = x1, . . . , X n= xn} =P{Y =y}P{X1 = x1, . . . , X n= xn|Y =y}

=

n

y

y(1 )ny 1n

y

=y(1 )ny =P{X1 = x1, . . . , X n= xn}.


56/113

13

14.3 The Factorization Theorem

Let Y = u(X) be a statistic for ; (X can be (X1, . . . , X n), and usually is). ThenY is sufficient for if and only if the density f(x) of X under can be factored asf(x) = g(, u(x))h(x).

[In the Bernoulli case, f(x1, . . . , xn) = y(1 )ny wherey = u(x) =ni=1 xi and

h(x) = 1.]

Proof. (Discrete case). IfY is sufficient, then

P{X=x} =P{X=x, Y =u(x)} =P{Y =u(x)}P{X= x|Y =u(x)}

=g(, u(x))h(x).

Conversely, assume f(x) = g(, u(x))h(x). Then

P{X=x|Y =y} = P{X= x, Y =y}P{Y =y} .

This is 0 unless y = u(x), in which case it becomes

P{X= x}P{Y =y} =

g(, u(x))h(x){z:u(z)=y} g(, u(z))h(z)

.

The g terms in both numerator and denominator are g(, y), which can be cancelled toobtain

P

{X= x

|Y =y

}=

h(x){z:u(z)=y} h(z)

which is free of .

14.4 Example

LetX1, . . . , X n be iid, each normal (, 2), so that

f(x1, . . . , xn) =

1

2

nexp

1

22

ni=1

(xi )2

.

Take = (, 2) and let x = n1

ni=1 xi, s

2 =n1

ni=1(xi x)2. Then

xi x= xi (x )and

s2 = 1

n

n1

(xi )2 2(x )n1

(xi ) + n(x )2

.


57/113

14

Thus

s2 = 1n

n1

(xi )2 (x )2.

The joint density is given by

f(x1, . . . , xn) = (22)n/2ens

2/22en(x)2/22 .

If and 2 are both unknown then (X, S2) is sufficient (take h(x) = 1). If2 is known

then we can take h(x) = (22)n/2ens2/22 , = , and X is sufficient. If is known

then (h(x) = 1) = 2 andn

i=1(Xi )2 is sufficient.

Problems

In Problems 1-6, show that the given statistic u(X) = u(X1, . . . , X n) is sufficient for and find appropriate functions g and h for the factorization theorem to apply.

1. TheXi are Poisson () andu(X) = X1+ + Xn.2. TheXihave densityA()B(xi), 0< xi< (and 0 elsewhere), where is a positive real

number; u(X) = max Xi. As a special case, the Xi are uniformly distributed between0 and , andA() = 1/,B(xi) = 1 on (0, ).

3. TheXi are geometric with parameter , i.e., if is the probability of success on a givenBernoulli trial, then P{Xi = x} = (1 )x is the probability that there will be xfailures followed by the first success; u(X) =

ni=1 Xi.

4. TheXi have the exponential density (1/)ex/, x >0, and u(X) =

ni=1 Xi.

5. TheXi have the beta density with parameters a = and b = 2, and u(X) = ni=1 Xi.6. The Xi have the gamma density with parameters = , an arbitrary positive

number, and u(X) =n

i=1 Xi.

7. Show that the result in (14.2) that statistician B can do at least as well as statisticianA, holds in the general case of arbitrary iid random variables Xi.


58/113

15

Lecture 15. Rao-Blackwell Theorem

15.1 Background From Basic Probability

To better understand the steps leading to the Rao-Blackwell theorem, consider a typicaltwo stage experiment:

Step 1. Observe a random variableXwith density (1/2)x2ex, x > 0.

Step 2. IfX= x, let Y be uniformly distributed on (0, x).

FindE(Y).

Method 1 via the joint density:

f(x, y) = fX(x)fY(y|x) = 12

x2ex(1

x) =

1

2xex, 0< y < x.

In general, E[g(X, Y)] =

g(x, y)f(x, y) dxdy. In this case, g(x, y) = y and

E(Y) =

x=0

xy=0

y(1/2)xex dydx=

0

(x3/4)ex dx=3!

4 =

3

2.

Method 2via the theorem of total expectation:

E(Y) =

fX(x)E(Y|X= x) dx.

Method 2 works well when the conditional expectation is easy to compute. In this caseit is x/2 by inspection. Thus

E(Y) =0

(1/2)x2ex(x/2) dx=3

2 as before.

15.2 Comment On Notation

If, for example, it turns out that E(Y|X = x) = x2 + 3x+ 4, we can write E(Y|X) =X2 + 3X+ 4. Thus E(Y|X) is a function g(X) of the random variable X. When X= xwe haveg (x) = E(Y|X=x).

We now proceed to the Rao-Blackwell theorem via several preliminary lemmas.

15.3 Lemma

E[E(X2

|X1)] = E(X2).

Proof. Let g(X1) = E(X2|X1). Then

E[g(X1)] =

g(x)f1(x) dx=

E(X2|X1 = x)f1(x) dx= E(X2)

by the theorem of total expectation.


59/113

16

15.4 Lemma

Ifi= E(Xi), i= 1, 2, thenE[{X2 E(X2|X1)}{E(X2|X1) 2}] = 0.

Proof. The expectation is

[x2 E(X2|X1 = x1)][E(X2|X1= x1) 2]f1(x1)f2(x2|x1) dx1 dx2

=

f1(x1)[E(X2|X1 = x1) 2]

[x2 E(X2|X1 = x1)]f2(x2|x1) dx2 dx1.

The inner integral (with respect to x2) isE(X2|X1 = x1) E(X2|X1 = x1) = 0, and theresult follows.

15.5 Lemma

Var X2 Var[E(X2|X1)].Proof. We have

Var X2 = E[(X2 2)2] = E

[{X2 E(X2|X1} + {E(X2|X1) 2}]2

ash - 2007 - lectures on statistics.pdf

Documents