ash - 2007 - lectures on statistics.pdf
TRANSCRIPT
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
1/113
1
Lectures On Statistics
Robert B. Ash
Preface
These notes are based on a course that I gave at UIUC in 1996 and again in 1997. No
prior knowledge of statistics is assumed. A standard first course in probability is a pre-
requisite, but the first 8 lectures review results from basic probability that are important
in statistics. Some exposure to matrix algebra is needed to cope with the multivariate
normal distribution in Lecture 21, and there is a linear algebra review in Lecture 19. Here
are the lecture titles:
1. Transformation of Random Variables
2. Jacobians
3. Moment-generating functions
4. Sampling from a normal population
5. The T and F distributions
6. Order statistics
7. The weak law of large numbers
8. The central limit theorem
9. Estimation
10. Confidence intervals
11. More confidence intervals
12. Hypothesis testing
13. Chi square tests
14. Sufficient statistics
15. Rao-Blackwell theorem
16. Lehmann-Scheffe theorem17. Complete sufficient statistics for the exponential class
18. Bayes estimates
19. Linear algebra review
20. Correlation
21. The multivariate normal distribution
22. The bivariate normal distribution
23. Cramer-Rao inequality
24. Nonparametric statistics
25. The Wilcoxon test
ccopyright 2007 by Robert B. Ash. Paper or electronic copies for personal use may be
made freely without explicit permission of the author. All other rights are reserved.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
2/113
1
Lecture 1. Transformation of Random Variables
Suppose we are given a random variable X with density fX(x). We apply a function gto produce a random variable Y = g(X). We can think of X as the input to a blackbox, and Y the output. We wish to find the density or distribution function ofY. Weillustrate the technique for the example in Figure 1.1.
-1
2e-x
1/2
-1
f (x)
x-axis
X
Y
y
X-S
qrt[y]
Sqrt[y]
Y = X2
Figure 1.1
The distribution function methodfinds FYdirectly, and then fY by differentiation.We have FY(y) = 0 for y 1 (Figure 1.3). Then
FY(y) =1
2+
y0
1
2ex dx=
1
2+
1
2(1 ey).
The density ofY is 0 for y
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
3/113
2
1/2
-1 x-axis
-Sqrt[y]
Sqrt[y]
f (x)X
1'
Figure 1.3
fY(y) = 1
4
y(1 + e
y), 0< y 1.
See Figure 1.4 for a sketch offY andFY. (You can takefY(y) to be anything you like aty= 1 because{Y = 1}has probability zero.)
y
f (y)Y
1' y
F (y)Y
1
1
2
y +
1
2
( 1 - e
- y
1
2
+
1
2
( 1 - e -
'
Figure 1.4
The density function method finds fY directly, and then FY by integration; seeFigure 1.5. We have fY(y)|dy| =fX(y)dx + fX(y)dx; we write|dy| because proba-bilities are never negative. Thus
fY(y) = fX(
y)
|dy/dx|x=y + fX(y)|dy/dx|x=y
withy = x2, dy/dx= 2x, so
fY(y) =fX(
y)
2
y +
fX(y)2
y .
(Note that| 2y| = 2y.) We havefY(y) = 0 for y
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
4/113
3
Case 2. y >1 (see Figure 1.3).
fY(y) = (1/2)ey
2
y + 0 = 1
4
ye
y
as before.
Y
y
Xy- y
Figure 1.5
The distribution function method generalizes to situations where we have a single out-put but more than one input. For example, letXandYbe independent, each uniformlydistributed on [0, 1]. The distribution function ofZ= X+ Y is
FZ(z) = P{X+ Y z} = x+yz
fXY(x, y) dxdy
withfXY(x, y) = fX(x)fY(y) by independence. Now FZ(z) = 0 for z 2 (because 0 Z 2).Case 1. If 0 z 1, then FZ(z) is the shaded area in Figure 1.6, which is z2/2.Case 2. If 1 z 2, thenFZ(z) is the shaded area in FIgure 1.7, which is 1 [(2z)2/2].Thus (see Figure 1.8)
fZ(z) =
z, 0 z 12 z, 1 z 20 elsewhere
.
Problems
1. Let X, Y, Z be independent, identically distributed (from now on, abbreviated iid)random variables, each with density f(x) = 6x5 for 0 x 1, and 0 elsewhere. Findthe distribution and density functions of the maximum ofX, Y andZ.
2. LetXand Ybe independent, each with densityex, x 0. Find the distribution (fromnow on, an abbreviation for Find the distribution or density function) ofZ= Y /X.
3. A discrete random variable Xtakes values x1, . . . , xn, each with probability 1/n. LetY =g(X) where gis an arbitrary real-valued function. Express the probability functionofY (pY(y) = P{Y =y}) in terms ofg and the xi.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
5/113
4
1
1
y
x
z
z
1
1
y
x
x+y = z
1 z 2
2-z
2-z
Figures 1.6 and 1.7
f (z)Z
-1
1 2' z
Figure 1.8
4. A random variable X has density f(x) = ax2 on the interval [0, b]. Find the densityofY =X3.
5. TheCauchy densityis given by f(y) = 1/[(1 + y2)] for all real y . Show that one wayto produce this density is to take the tangent of a random variableXthat is uniformlydistributed between/2 and /2.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
6/113
5
Lecture 2. Jacobians
We need this idea to generalize the density function method to problems where there arek inputs and k outputs, with k 2. However, if there are k inputs and j < k outputs,often extra outputs can be introduced, as we will see later in the lecture.
2.1 The Setup
Let X = X(U, V), Y = Y(U, V). Assume a one-to-one transformation, so that we cansolve for U and V . ThusU =U(X, Y), V = V(X, Y). Look at Figure 2.1. Ifu changesbydu thenx changes by (x/u) duand y changes by (y/u) du. Similarly, ifv changesbydv thenx changes by (x/v) dv and y changes by (y/v) dv. The small rectangle intheu v plane corresponds to a small parallelogram in the x y plane (Figure 2.2), withA = (x/u,y/u, 0) du and B = (x/v,y/v, 0) dv. The area of the parallelogramis
|A
B
|and
A B =
I J Kx/u y/u 0x/v y/v 0
dudv=x/u x/vy/u y/v
du dvK.(A determinant is unchanged if we transpose the matrix, i.e., interchange rows andcolumns.)
x
udu
y
udu
x
y
u
v
du
dvR
Figure 2.1
A
B
S
Figure 2.2
2.2 Definition and Discussion
TheJacobianof the transformation is
J=
x/u x/vy/u y/v , written as (x, y)(u, v) .
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
7/113
6
Thus|AB | =|J| dudv. Now P{(X, Y) S} = P{(U, V) R}, in other words,fXY(x, y) times the area ofS isfUV(u, v) times the area ofR. Thus
fXY(x, y)|J| dudv= fUV(u, v) dudv
and
fUV(u, v) = fXY(x, y)
(x, y)(u, v).
The absolute value of the Jacobian (x, y)/(u, v) gives a magnification factor for areain going from u v coordinates to x y coordinates. The magnification factor going theother way is|(u, v)/(x, y)|. But the magnification factor from u v tou v is 1, so
fUV(u, v) = fXY(x, y)
(u, v)/(x, y) .In this formula, we must substitute x = x(u, v), y = y(u, v) to express the final result interms ofu and v .
In three dimensions, a small rectangular box with volume dudvdw corresponds to aparallelepiped in xyz space, determined by vectors
A=
x
u
y
u
z
u
du, B=
x
v
y
v
z
v
dv, C=
x
w
y
w
z
w
dw.
The volume of the parallelepiped is the absolute value of the dot product ofA withB C,and the dot product can be written as a determinant with rows (or columns) A, B, C. Thisdeterminant is the Jacobian ofx, y, z with respect to u, v, w[written(x, y, z)/(u,v,w)],timesdudvdw. The volume magnification fromuvw to xyz space is
|(x, y, z)/(u, v, w)
|and we have
fUVW(u,v,w) = fXY Z(x, y, z)
|(u,v,w)/(x, y, z)|withx = x(u,v,w), y= y(u,v,w), z= z(u, v, w).
The jacobian technique extends to higher dimensions. The transformation formula isa natural generalization of the two and three-dimensional cases:
fY1Y2Yn(y1, . . . , yn) = fX1Xn(x1, . . . , xn)
|(y1, . . . , yn)/(x1, . . . , xn)|where
(y1, . . . , yn)
(x1, . . . , xn)=
y1x1
y1xn...
ynx1
ynxn
.
To help you remember the formula, think f(y) dy= f(x)dx.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
8/113
7
2.3 A Typical Application
LetXand Ybe independent, positive random variables with densities fX andfY, and letZ= X Y. We find the density ofZby introducing a new random variable W, as follows:
Z= XY, W =Y
(W =Xwould be equally good). The transformation is one-to-one because we can solveforX, Yin terms ofZ, WbyX= Z/W, Y =W. In a problem of this type, we must alwayspay attention to the range of the variables: x > 0, y > 0 is equivalent to z > 0, w > 0.Now
fZW(z, w) = fXY(x, y)
|(z, w)/(x, y)|x=z/w,y=wwith
(z, w)(x, y) =z/x z/yw/x w/y = y x0 1 =y.
Thus
fZW(z, w) =fX(x)fY(y)
w =
fX(z/w)fY(w)
w
and we are left with the problem of finding the marginal density from a joint density:
fZ(z) =
fZW(z, w) dw=
0
1
wfX(z/w)fY(w) dw.
Problems
1. The joint density of two random variables X1
and X2
is f(x1
, x2
) = 2ex1e
x2 ,
where 0 < x1 < x2 0. The transformation equations are givenbyY1 = X1/(X1+ X2), Y2 = (X1+ X2)/(X1+ X2+ X3), Y3 = X1+ X2+ X3. Asbefore, find the joint density of the Yi and show thatY1, Y2 andY3 are independent.
Comments on the Problem Set
In Problem 3, notice that Y1Y2Y3 = X1, Y2Y3 = X1+X2, so X2 = Y2Y3Y1Y2Y3, X3 =(X1+ X2+ X3) (X1+ X2) = Y3 Y2Y3.IffXY(x, y) = g(x)h(y) for all x, y, thenXandY are independent, because
f(y|x) = fXY(x, y)fX(x)
= g(x)h(y)
g(x) h(y) dy
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
9/113
8
which does not depend on x. The set of points where g (x) = 0 (equivalently fX(x) = 0)can be ignored because it has probability zero. It is important to realize that in this
argument, for allx, y means thatxandy must be allowed to vary independently of eachother, so the set of possiblex and y must be of the rectangular form a < x < b, c < y < d.(The constantsa,b,c, dcan be infinite.) For example, iffXY(x, y) = 2e
xey, 0< y < x,and 0 elsewhere, then XandY arenot independent. Knowingx forces 0< y < x, so theconditional distribution ofY given X= x certainly depends on x. Note that fXY(x, y)isnota function ofx alone times a function ofy alone. We have
fXY(x, y) = 2exeyI[0< y < x]
where the indicator Iis 1 for 0 < y < x and 0 elsewhere.
In Jacobian problems, pay close attention to the range of the variables. For example, inProblem 1 we have y1 = 2x1, y2 = x2 x1, so x1 = y1/2, x2 = (y1/2) +y2. From theseequations it follows that 0< x1 < x2 0, y2 > 0.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
10/113
9
Lecture 3. Moment-Generating Functions
3.1 Definition
Themoment-generating functionof a random variable X is defined by
M(t) = MX(t) = E[etX ]
where t is a real number. To see the reason for the terminology, note thatM(t) is theexpectation of 1 + tX+ t2X2/2! + t3X3/3! + . Ifn= E(Xn), then-th moment ofX,and we can take the expectation term by term, then
M(t) = 1 + 1t +2t2
2! + + nt
n
n! + .
Since the coefficient oftn in the Taylor expansion is M(n)(0)/n!, where M(n) is the n-thderivative ofM, we have
n= M(n)(0).
3.2 The Key Theorem
IfY =ni=1 Xi where X1, . . . , X n are independent, then MY(t) =
ni=1 MXi(t).
Proof. First note that ifXand Y are independent, then
E[g(X)h(Y)] =
g(x)h(y)fXY(x, y) dxdy.
SincefXY(x, y) = fX(x)fY(y), the double integral becomes
g(x)fX(x) dx
h(y)fY(y) dy= E[g(X)]E[h(Y)]
and similarly for more than two random variables. Now ifY = X1+ +Xn with theXis independent, we have
MY(t) = E[etY] = E[etX1 etXn] = E[etX1 ] E[etXn ] = MX1(t) MXn(t).
3.3 The Main Application
Given independent random variables X1, . . . , X n with densities f1, . . . , f n respectively,find the density ofY =
ni=1 Xi.
Step 1. Compute Mi(t), the moment-generating function ofXi, for each i.
Step 2. Compute MY(t) =ni=1 Mi(t).
Step 3. From MY(t) find fY(y).
This technique is known as a transform method. Notice that the moment-generatingfunction and the density of a random variable are related by M(t) =
etxf(x) dx.
With t replaced byswe have a Laplace transform, and with t replaced by it we have aFourier transform. The strategy works because at step 3, the moment-generating functiondetermines the density uniquely. (This is a theorem from Laplace or Fourier transformtheory.)
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
11/113
10
3.4 Examples
1. Bernoulli Trials. Let Xbe the number of successes in n trials with probability ofsuccessp on a given trial. Then X=X1+ + Xn, whereXi= 1 if there is a success ontriali and Xi= 0 if there is a failure on trial i. Thus
Mi(t) = E[etXi ] = P{Xi = 1}et1 + P{Xi= 0}et0 =pet + q
withp + q= 1. The moment-generating function ofX is
MX(t) = (pet + q)n =
nk=0
n
k
pkqnketk.
This could have been derived directly:
MX(t) = E[etX ] =
n
k=0
P{
X=k}
etk =
n
k=0
nkpkqnketk = (pet + q)n
by the binomial theorem.
2. Poisson. We have P{X= k} =ek/k!, k= 0, 1, 2, . . . . Thus
M(t) =k=0
ek
k! etk =e
k=0
(et)k
k! = exp()exp(et) = exp[(et 1)].
We can compute the mean and variance from the moment-generating function:
E(X) = M(0) = [exp((et 1))et]t=0 = .Leth(, t) = exp[(et 1)]. Then
E(X2) = M(0) = [h(, t)et + eth(, t)et]t=0 = + 2
hence
Var X=E(X2) [E(X)]2 = + 2 2 =.
3. Normal(0,1). The moment-generating function is
M(t) = E[etX ] =
etx 1
2ex
2/2 dx
Now(x2/2) + tx= (1/2)(x2 2tx + t2 t2) = (1/2)(x t)2 + (1/2)t2 so
M(t) = et2
/2
12 exp[(x t)2/2] dx.
The integral is the area under a normal density (mean t, variance 1), which is 1. Conse-quently,
M(t) = et2/2.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
12/113
11
4. Normal(, 2). IfX is normal(, 2), then Y = (X )/ is normal(0,1). This is agood application of the density function method from Lecture 1:
fY(y) = fX(x)
|dy/dx|x=+y = 1
2ey
2/2.
We have X= + Y, so
MX(t) = E[etX ] = etE[etY ] = etMY(t).
Thus
MX(t) = etet
22/2.
Remember this technique, which is especially useful when Y =aX+ b and the moment-generating function ofX is known.
3.5 Theorem
IfX is normal(, 2) and Y =aX+ b, then Y is normal(a + b, a22).
Proof. We compute
MY(t) = E[etY] = E[et(aX+b)] = ebtMX(at) = e
bteatea2t22/2.
Thus
MY(t) = exp[t(a + b)] exp(t2a22/2).
Here is another basic result.
3.6 Theorem
LetX1, . . . , X n be independent, with Xi normal (i, 2i ). Then Y =
ni=1 Xi is normal
with mean =ni=1 i and variance
2 =ni=1
2i .
Proof. The moment-generating function ofY is
MY(t) =
ni=1
exp(tii+ t22i /2) = exp(t + t
22/2).
A similar argument works for the Poisson distribution; see Problem 4.
3.7 The Gamma DistributionFirst, we define the gamma function () =
0
y1ey dy, > 0. We need threeproperties:
(a) ( + 1) =(), the recursion formula;
(b) (n + 1) =n!, n= 0, 1, 2, . . . ;
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
13/113
12
(c) (1/2) =
.
To prove (a), integrate by parts: () = 0
eyd(y/). Part (b) is a special case of (a).
For (c) we make the change of variable y = z2/2 and compute
(1/2) =
0
y1/2ey dy=0
2z1ez
2/2zdz.
The second integral is 2
times half the area under the normal(0,1) density, that is,2
(1/2) =
.
Thegamma density is
f(x) = 1
()x1ex/
where and are positive constants. The moment-generating function is
M(t) =
0
[()]1x1etxex/ dx.
Change variables viay = (t + (1/))x to get0
[()]1
y
t + (1/)1
ey dy
t + (1/)which reduces to
1
1
t
= (1 t).
In this argument, t must be less than 1/so that the integrals will be finite.
SinceM(0) = f(x) dx=
0
f(x) dx in this case, with f 0, M(0) = 1 implies thatwe have a legal probability density. As before, moments can be calculated efficiently fromthe moment-generating function:
E(X) = M(0) = (1 t)1()|t=0 = ;
E(X2) = M(0) = ( 1)(1 t)2()2|t=0 = ( + 1)2.
Thus
Var X= E(X2
) [E(X)]2
=2
.
3.8 Special Cases
Theexponential densityis a gamma density with = 1 : f(x) = (1/)ex/, x 0, withE(X) = , E(X2) = 22, Var X= 2.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
14/113
13
A random variable X has the chi square density with r degrees of freedom (X = 2(r)for short, where r is a positive integer) if its density is gamma with = r/2 and = 2.
Thus
f(x) = 1
(r/2)2r/2x(r/2)1ex/2, x 0
and
M(t) = 1
(1 2t)r/2 , t < 1/2.
ThereforeE[2(r)] = = r, Var[2(r)] = 2 = 2r.
3.9 Lemma
IfX is normal(0,1) then X2 is2(1).
Proof. We compute the moment-generating function ofX2 directly:
MX2(t) = E[etX2 ] =
etx2 1
2ex
2/2 dx.
Lety =
1 2tx; the integral becomes
12
ey2/2 dy
1 2t = (1 2t)1/2
which is 2(1).
3.10 Theorem
IfX1, . . . , X n are independent, each normal (0,1), then Y =ni=1 X
2i is2(n).
Proof. By (3.9), each X2i is 2(1) with moment-generating function (1 2t)1/2. Thus
MY(t) = (1 2t)n/2 fort
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
15/113
14
3.12 The Poisson Process
This process occurs in many physical situations, and provides an application of the gammadistribution. For example, particles can arrive at a counting device, customers at a servingcounter, airplanes at an airport, or phone calls at a telephone exchange. Divide the timeinterval [0, t] into a large number n of small subintervals of length dt, so that n dt= t. IfIi, i= 1, . . . , n, is one of the small subintervals, we make the following assumptions:
(1) The probability of exactly one arrival in Ii is dt, where is a constant.
(2) The probability of no arrivals in Ii is 1 dt.(3) The probability of more than one arrival in Ii is zero.
(4) IfAi is the event of an arrival in Ii, then the Ai, i= 1, . . . , nare independent.
As a consequence of these assumptions, we have n = t/dt Bernoulli trials with prob-ability of success p = dt on a given trial. Asdt0 we have n and p0, withnp= t. We conclude that the number N[0, t] of arrivals in [0, t] is Poisson (t):
P{N[0, t] = k} =et(t)k/k!, k= 0, 1, 2, . . . .SinceE(N[0, t]) = t, we may interpret as theaverage number of arrivals per unit time.
Now let W1 be the waiting time for the first arrival. Then
P{W1 > t} =P{no arrival in [0,t]} =P{N[0, t] = 0} =et, t 0.Thus FW1(t) = 1 et andfW1(t) =et, t 0. From the formulas for the mean andvariance of an exponential random variable we have E(W1) = 1/and Var W1 = 1/
2.
LetWk be the (total) waiting time for the k -th arrival. Then Wk is the waiting timefor the first arrival plus the time after the first up to the second arrival plus plus thetime after arrival k 1 up to the k-th arrival. Thus Wk is the sum of k independentexponential random variables, and
MWk(t) =
1
1 (t/)k
so Wk is gamma with = k, = 1/. Therefore
fWk(t) = 1
(k 1)! ktk1et, t 0.
Problems
1. Let X1 and X2 be independent, and assume that X1 is 2(r1) and Y = X1+ X2 is2(r), where r > r1. Show that X2 is
2(r2), where r2 = r r1.
2. Let X1 and X2 be independent, with Xi gamma with parameters i and i, i = 1, 2.If c1 and c2 are positive constants, find convenient sufficient conditions under whichc1X1+ c2X2 will also have the gamma distribution.
3. If X1, . . . , X n are independent random variables with moment-generating functionsM1, . . . , M n, andc1, . . . , cn are constants, express the moment-generating function Mofc1X1+ + cnXn in terms of the Mi.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
16/113
15
4. IfX1, . . . , X n are independent, with Xi Poisson(i), i= 1, . . . , n, show that the sumY = ni=1 Xi has the Poisson distribution with parameter = ni=1 i.
5. An unbiased coin is tossed independentlyn1times and then again tossed independentlyn2 times. Let X1 be the number of heads in the first experiment, and X2 the numberoftailsin the second experiment. Without using moment-generating functions, in factwithout any calculation at all, find the distribution ofX1+ X2.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
17/113
16
Lecture 4. Sampling From a Normal Population
4.1 Definitions and Comments
LetX1, . . . , X n be iid. The sample meanof the Xi is
X= 1
n
ni=1
Xi
and the sample variance is
S2 = 1
n
ni=1
(Xi X)2.
If theXi have mean and variance 2, then
E(X) = 1n
ni=1
E(Xi) = 1n
n=
and
Var X= 1
n2
ni=1
Var Xi=n2
n2 =
2
n 0 as n .
Thus X is a good estimate of . (For large n, the variance of X is small, so X isconcentrated near its mean.) The sample variance is an average squared deviation fromthe sample mean, but it is a biased estimate of the true variance 2:
E[(Xi X)2] = E[(Xi ) (X )]2 = Var Xi+ Var X 2E[(Xi )(X )].Notice the centralizing technique. We subtract and add back the mean ofXi, which willmake the cross terms easier to handle when squaring. The above expression simplifies to
2 +2
n 2E[(Xi ) 1
n
nj=1
(Xj )] = 2 + 2
n 2
nE[(Xi )2].
Thus
E[(Xi X)2] = 2(1 + 1n 2
n) =
n 1n
2.
Consequently, E(S2) = (n 1)2/n, not 2. Some books define the sample variance as1
n 1
n
i=1
(Xi
X)2 = n
n 1S2
whereS2 is our sample variance. This adjusted estimate of the true variance is unbiased(its expectation is 2), but biased does not mean bad. If we measure performance byasking for a small mean square error, the biased estimate is better in the normal case, aswe will see at the end of the lecture.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
18/113
17
4.2 The Normal Case
We now assume that theXi are normally distributed, and find the distribution ofS2
. Lety1 = x = (x1 + +xn)/n, y2 = x2x, . . . , yn= xnx. Theny1 +y2 = x2, y1+y3 =x3, . . . , y1 + yn= xn. Add these equations to get (n 1)y1 + y2 + + yn= x2 + + xn,or
ny1+ (y2+ + yn) = (x2+ + xn) + y1 (1)Butny1 = nx = x1 + + xn, so by cancellingx2, . . . , xn in (1), x1 + (y2 + + yn) = y1.Thus we can solve for the xs in terms of they s:
x1 = y1 y2 ynx2 = y1+ y2
x3 = y1+ y3 (2)
...
xn= y1+ yn
The Jacobian of the transformation is
dn=(x1, . . . , xn)
(y1, . . . , yn) =
1 1 1 11 1 0 01 0 1 0...1 0 0 1
To see the pattern, look at the 4 by 4 case and expand via the last row:
1 1 1 11 1 0 01 0 1 01 0 0 1
= (1)
1 1 11 0 00 1 0
+1 1 11 1 01 0 1
sod4 = 1 + d3. In general, dn= 1 + dn1, and since d2 = 2 by inspection, we havedn= nfor all n 2. Now
ni=1
(xi )2 =
(xi x + x )2 =
(xi x)2 + n(x )2 (3)
because
(xix) = 0. By (2), x1x= x1y1 = y2 ynand xix= xiy1 = yifori = 2, . . . , n. (Remember that y1 = x.) Thus
ni=1
(xi x)2 = (y2 yn)2 +ni=2
y2i (4)
Now
fY1Yn(y1, . . . , yn) = nfX1Xn(x1, . . . , xn).
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
19/113
18
By (3) and (4), the right side becomes, in terms of the yis,
n
12
n
exp
122
ni=2
yi2 ni=2
y2i n(y1 )2
.
The joint density of Y1, . . . , Y n is a function of y1 times a function of (y2, . . . , yn), soY1 and (Y2, . . . , Y n) are independent. Since X = Y1 and [by (4)] S
2 is a function of(Y2, . . . , Y n),
X and S2 are independent
Dividing Equation (3) by 2 we have
n
i=1
Xi
2
=nS2
2
+ X
/n2
.
But (Xi )/ is normal (0,1) and
X /
n
is normal (0,1)
so2(n) = (nS2/2) + 2(1) with the two random variables on the right independent. IfM(t) is the moment-generating function ofnS2/2, then (12t)n/2 =M(t)(12t)1/2.ThereforeM(t) = (1 2t)(n1)/2, i.e.,
nS2
2 is 2(n 1)
The random variable
T = X S/
n 1
is useful in situations where is to be estimated but the true variance 2 is unknown. Itturns out that Thas a Tdistribution, which we study in the next lecture.
4.3 Performance of Various Estimates
Let S2 be the sample variance of iid normal (, 2) random variables X1, . . . , X n. Wewill look at estimates of2 of the formcS2, wherec is a constant. Once again employing
the centralizing technique, we writeE[(cS2 2)2] = E[(cS2 cE(S2) + cE(S2) 2)2]
which simplifies to
c2 Var S2 + (cE(S2) 2)2.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
20/113
19
SincenS2/2 is2(n1), which has variance 2(n1), we haven2(Var S2)/4 = 2(n1).Also nE(S2)/2 is the mean of2(n
1), which is n
1. (Or we can recall from (4.1)
thatE(S2) = (n 1)2/n.) Thus the mean square error isc224(n 1)
n2 +
c
(n 1)n
2 22.We can drop the 4 and use n2 as a common denominator, which can also be dropped.We are then trying to minimize
c22(n 1) + c2(n 1)2 2c(n 1)n + n2.Differentiate with respect to c and set the result equal to zero:
4c(n 1) + 2c(n 1)2 2(n 1)n= 0.Dividing by 2(n 1), we have 2c+c(n 1) n = 0, so c = n/(n+ 1). Thus the bestestimate of the form cS2 is
1
n + 1
ni=1
(Xi X)2.
If we use S2 then c = 1. If we us the unbiased version then c = n/(n1). Since[n/(n+ 1)] < 1 < [n/(n 1)] and a quadratic function decreases as we move towardits minimum, w see that the biased estimate S2 is better than the unbiased estimatenS2/(n1), but neither is optimal under the minimum mean square error criterion.Explicitly, whenc = n/(n 1) we get a mean square error of 24/(n 1) and whenc = 1we get
4
n2
2(n 1) + (n 1 n)2 =(2n 1)4n2
which is always smaller, because [(2n
1)/n2] < 2/(n
1) iff 2n2 > 2n2
3n+ 1 iff
3n >1, which is true for every positive integer n.
For largen all these estimates are good and the difference between their performanceis small.
Problems
1. Let X1, . . . , X n be iid, each normal (, 2), and let Xbe the sample mean. Ifc is a
constant, we wish to maken large enough so that P{ c < X < + c} .954. Findthe minimum value ofn in terms of2 andc. (It is independent of.)
2. Let X1, . . . , X n1 , Y1, . . . Y n2 be independent random variables, with the Xi normal(1, 21) and the Yi normal (2,
22). IfX is the sample mean of the Xi and Y is the
sample mean of the Yi, explain how to compute the probability thatX > Y.
3. LetX1, . . . , X nbe iid, each normal (, 2), and letS2 be the sample variance. Explainhow to compute P{a < S2 < b}.
4. Let S2 be the sample variance of iid normal (, 2) random variables Xi, i= 1 . . . , n.Calculate the moment-generating function ofS2 and from this, deduce that S2 has agamma distribution.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
21/113
20
Lecture 5. The T and F Distributions
5.1 Definition and Discussion
TheTdistributionis defined as follows. Let X1 and X2 be independent, with X1 normal(0,1) andX2chi-square withr degrees of freedom. The random variable Y1 =
rX1/
X2
has the Tdistribution with r degrees of freedom.
To find the density ofY1, let Y2 = X2. ThenX1 = Y1
Y2/
r and X2 = Y2. Thetransformation is one-to-one with < X1 < , X2 >0 < Y1 < , Y2 >0.The Jacobian is given by
(x1, x2)
(y1, y2) =
y2/r y1/(2
ry2)0 1
= y2/r.ThusfY1Y2(y1, y2) = fX1X2(x1, x2)
y2/r, which upon substitution forx1and x2becomes
12
exp[y21y2/2r] 1(r/2)2r/2 y(r/2)1
2 ey2/2
y2/r.
The density ofY1 is
12(r/2)2r/2
0
y[(r+1)/2]12 exp[(1 + (y21/r))y2/2] dy2/
r.
Withz = (1 + (y21/r))y2/2 and the observation that all factors of 2 cancel, this becomes(withy1 replaced by t)
((r+ 1)/2)r(r/2)
1
(1 + (t2/r))(r+1)/2, < t < ,
theTdensitywithr degrees of freedom.
In sampling from a normal population, (X )/(/n) is normal (0,1), and nS2
/
2
is2(n 1). Thus
n 1 (X )/
n
divided by
nS/ is T(n 1).
Since and
n disappear after cancellation, we have
X S/
n 1 is T(n 1)
Advocates of defining the sample variance with n 1 in the denominator point out thatone can simply replace byS in (X )/(/n) to get the T statistic.
Intuitively, we expect that for large n, (X )/(S/n 1) has approximately thesame distribution as (X)/(/n), i.e., normal (0,1). This is in fact true, as suggestedby the following computation:
1 +t2
r
(r+1)/2=
1 +
t2
r
r 1 +
t2
r
1/2
et2 1 = et2/2
as r .
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
22/113
21
5.2 A Preliminary Calculation
Before turning to theFdistribution, we calculate the density ofU=X1/X2whereX1andX2 are independent, positive random variables. Let Y =X2, so that X1 =U Y , X 2 =Y(X1, X2, U , Y ) are all greater than zero). The Jacobian is
(x1, x2)
(u, y) =
y u0 1 =y.
Thus fUY(u, y) = fX1X2(x1, x2)y= fX1(uy)fX2(y), and the density ofU is
h(u) =
0
yfX1(uy)fX2(y) dy.
Now we take X1 to be 2(m), andX2 to be
2(n). The density ofX1/X2 is
h(u) = 12(m+n)/2(m/2)(n/2)
u(m/2)10
y[(m+n)/2]1ey(1+u)/2 dy.
The substitution z = y(1 + u)/2 gives
h(u) = 1
2(m+n)/2(m/2)(n/2)u(m/2)1
0
z[(m+n)/2]1
[(1 + u)/2][(m+n)/2]1ez
2
1 + udz.
We abbreviate (a)(b)/(a + b) by(a, b). (We will have much more to say about thiswhen we discuss the beta distribution later in the lecture.) The above formula simplifiesto
h(u) = 1
(m/2, n/2)
u(m/2)1
(1 + u)(m+n)/2, u
0.
5.3 Definition and Discussion
The F density is defined as follows. LetX1 and X2 be independent, with X1 = 2(m)
andX2 = 2(n). WithUas in (5.2), let
W =X1/m
X2/n =
n
mU
so that
fW(w) = fU(u)
du
dw
=m
nfU
m
nw
.
Thus Whas density
(m/n)m/2
(m/2, n/2)
w(m/2)1
[1 + (m/n)w](m+n)/2, w 0,
theF densitywith m and n degrees of freedom.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
23/113
22
5.4 Definitions and Calculations
Thebeta function is given by
(a, b) =
10
xa1(1 x)b1 dx, a, b >0.
We will show that
(a, b) =(a)(b)
(a + b)
which is consistent with our use of(a, b) as an abbreviation in (5.2). We make the changeof variable t = x2 to get
(a) =0
ta1et dt= 20
x2a1ex2 dx.
We now use the familiar trick of writing (a)(b) as a double integral and switching topolar coordinates. Thus
(a)(b) = 4
0
0
x2a1y2b1e(x2+y2) dxdy
= 4
/20
d
0
(cos )2a1(sin )2b1er2
r2a+2b1 dr.
The change of variable u = r2 yields
0
r2a+2b1er2
dr= (1/2)
0
ua+b1eu du= (a + b)/2.
Thus
(a)(b)
2(a + b)=
/20
(cos )2a1(sin )2b1 d.
Let z = cos2 , 1 z = sin2 ,dz =2cos sin d =2z1/2(1 z)1/2 dz. The aboveintegral becomes
1
2 0
1
za1(1
z)b1 dz =1
2 1
0
za1(1
z)b1 dz =1
2(a, b)
as claimed. The beta density is
f(x) = 1
(a, b)xa1(1 x)b1, 0 x 1 (a,b >0).
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
24/113
23
Problems
1. LetXhave the beta distribution with parametersa andb. Find the mean and varianceofX.
2. Let T have the Tdistribution with 15 degrees of freedom. Find the value ofc whichmakes P{c T c} =.95.
3. Let W have the F distribution with m and n degrees of freedom (abbreviated W =F(m, n)). Find the distribution of 1/W.
4. A typical table of theFdistribution gives values ofP{W c} forc = .9, .95, .975 and.99. Explain how to find P{W c} for c = .1, .05, .025 and .01. (Use the result ofProblem 3.)
5. Let X have the T distribution with n degrees of freedom (abbreviated X = T(n)).Show that T2(n) = F(1, n), in other words, T2 has an F distribution with 1 and ndegrees of freedom.
6. IfXhas the exponential density ex, x 0, show that 2X is 2(2). Deduce that thequotient of two exponential random variables is F(2, 2).
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
25/113
1
Lecture 6. Order Statistics
6.1 The Multinomial Formula
Suppose we pick a letter from{A , B, C}, with P(A) =p1 =.3, P(B) =p2 =.5, P(C) =p3 = .2. If we do this independently 10 times, we will find the probability that theresulting sequence contains exactly 4 As, 3B s and 3Cs.
The probability ofAAAABBBCC C, in that order, isp41p32p
33. To generate all favorable
cases, select 4 positions out of 10 for the As, then 3 positions out of the remaining 6 for theBs. The positions for theCs are then determined. One possibility is BCAABACCAB.The number of favorable cases is
10
4
6
3
=
10!
4!6!
6!
3!3!=
10!
4!3!3!.
Therefore the probability of exactly 4 As,3B s and 3 Cs is
10!
4!3!3!(.3)4(.5)3(.2)3
In general, consider n independent trials such that on each trial, the result is ex-actly one of the events A1, . . . , Ar, with probabilities p1, . . . , pr respectively. Then theprobability that A1 occurs exactly n1 times, . . . , Ar occurs exactly nr times, is
pn11 pnrr
n
n1
n n1
n2
n n1 n2
n3
n n1 nr2nr1
n4nr
which reduces to the multinomial formula
n!n1! nr!p
n11 pnrr
where the pi are nonnegative real numbers that sum to 1, and the ni are nonnegativeintegers that sum to n.
Now let X1, . . . , X n be iid, each with density f(x) and distribution function F(x).Let Y1 < Y2 x} =ni=1
P{Xi > x} = [1 F(x)]n.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
26/113
2
Therefore
FY1(x) = 1 [1 F(x)]n and fY1(x) = n[1 F(x)]n1f(x).We compute fYk(x) by asking how it can happen that x Yk x+ dx (see Figure6.1). There must be k 1 random variables less than x, one random variable betweenx and x+dx, and n k random variables greater than x. (We are taking dx so smallthat the probability that more one random variable falls in [x, x+dx] is negligible, andP{Xi > x} is essentially the same as P{Xi > x+ dx}. Not everyone is comfortablewith this reasoning, but the intuition is very strong and can be made precise.) By themultinomial formula,
fYk(x) dx= n!
(k 1)!1!(n k)! [F(x)]k1f(x) dx[1 F(x)]nk
so
fYk(x) = n!
(k 1)!1!(n k)! [F(x)]k1[1 F(x)]nkf(x).
Similar reasoning (see Figure 6.2) allows us to write down the joint density fYjYk(x, y) ofYj andYk forj < k , namely
n!
(j 1)!(k j 1)!(n k)! [F(x)]j1[F(y) F(x)]kj1[1 F(y)]nkf(x)f(y)
for x < y, and 0 elsewhere. [We drop the term 1! (=1), which we retained for emphasisin the formula for fYk(x).]
k-1 1 n-k
x x + dx' '
Figure 6.1
1
x x + dx' '
y y + dy
j-1 k-j-1 1 n-k
' '
Figure 6.2
Problems
1. LetY1 < Y2 < Y3 be the order statistics ofX1, X2 and X3, where theXi are uniformlydistributed between 0 and 1. Find the density ofZ= Y3 Y1.
2. The formulas derived in this lecture assume that we are in the continuous case (thedistribution functionFis continuous). The formulas do not apply if theXiare discrete.Why not?
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
27/113
3
3. Consider order statistics where theXi, i= 1, . . . , n, are uniformly distributed between0 and 1. Show that Yk has a beta distribution, and express the parameters and in
terms ofk andn.
4. In Problem 3, let 0 < p < 1, and express P{Yk > p} as the probability of an eventassociated with a sequence ofn Bernoulli trials with probability of success p on a giventrial. WriteP{Yk > p} as a finite sum involving n, p and k .
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
28/113
4
Lecture 7. The Weak Law of Large Numbers
7.1 Chebyshevs Inequality
(a) IfX 0 and a >0, then P{X a} E(X)/a.(b) If X is an arbitrary random variable, c any real number, and > 0, m > 0, thenP{|X c| } E(|X c|m)/m.(c) IfXhas finite mean and finite variance2, then P{|X | k} 1/k2.
This is a universal bound, but it may be quite weak in a specific cases. For example,ifXis normal (, 2), abbreviated N(, 2), then
P{|X | 1.96} =P{|N(0, 1)| 1.96} = 2(1 (1.96)) =.05where is the distribution function of a normal (0,1) random variable. But the Chebyshevbound is 1/(1.96)2 =.26.
Proof.(a) IfXhas density f, then
E(X) =
0
xf(x) dx=
a0
xf(x) dx +
a
xf(x) dx
so
E(X) 0 + a
af(x) dx= aP{X a}.
(b)P{|X c| } =P{|X c|m m} E(|X c|m)/m by (a).(c) By (b) with c = , = k, m= 2, we have
P{|X | k} E[(X
)2]
k22 =
1
k2 .
7.2 Weak Law of Large Numbers
LetX1, . . . , X nbe iid with finite meanand finite variance2. For largen, the arithmetic
average of the observations is very likely to be very close to the true mean . Formally,ifSn= X1+ + Xn, then for any >0,
P{Snn
} 0 asn .Proof.
P{Snn
} =P{|Sn n| n} E[(Sn n)2]n22
by Chebyshev (b). The term on the right is
Var Snn22
= n2
n22 =
2
n2 0.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
29/113
5
7.3 Bernoulli Trials
LetXi= 1 if there is a success on trial i, and Xi= 0 if there is a failure. ThusXi is theindicator of a success on trial i, often written as I[Success on trial i]. Then Sn/n is therelative frequency of success, and for large n, this is very likely to be very close to thetrue probability p of success.
7.4 Definitions and Comments
The convergence illustrated by the weak law of large numbers is called convergence in
probability. Explicitly, Sn/n converges in probability to . In general,XnP Xmeans
that for every >0, P{|Xn X| } 0 asn . Thus for large n, Xn is very likelyto be very close to X. IfXn converges in probability to X, then Xn converges to X indistribution: IfFn is the distribution function ofXn and Fis the distribution functionofX, then Fn(x)
F(x) at every x whereF is continuous. To see that the continuity
requirement is needed, look at Figure 7.1. In this example,Xn is uniformly distributedbetween 0 and 1/n, and X is identically 0. We haveXn
P 0 because P{|Xn| } isactually 0 for large n. However, Fn(x) F(x) for x = 0, but not atx = 0.
To prove that convergence in probability implies convergence in distribution:
Fn(x) = P{Xn x} =P{Xn x,X > x + } + P{Xn x, X x + } P{|Xn X| } + P{X x + }=P{|Xn X| } + F(x + )
F(x ) = P{X x } =P{X x , Xn> x} + P{X x , Xn x} P{|Xn X| } + P{Xn x}=P{|Xn X| } + Fn(x).
Therefore
F(x ) P{|Xn X| } Fn(x) P{|Xn X| } + F(x + ).
SinceXn converges in probability to X, we haveP{|XnX| } 0 asn . IfF iscontinuous atx, thenF(x) andF(x + ) approachF(x) as 0. ThusFn(x) is boxedbetween two quantities that can be made arbitrarily close to F(x), so Fn(x) F(x).
7.5 Some Sufficient Conditions
In practice,P{|XnX| } may be difficult to compute, and it is useful to have sufficientconditions for convergence in probability that can often be easily checked.
(1) IfE[(Xn X)2
] 0 as n , thenXnP
X.(2) IfE(Xn) E(X) and Var(Xn X) 0, thenXn P X.Proof. The first statement follows from Chebyshev (b):
P{|Xn X| } E[(Xn X)2]
2 0.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
30/113
6
To prove (2), note that
E[(Xn X)2
] = Var(Xn X) + [E(Xn) E(X)]2
0. In this result, if X is identically equal to a constant c, then Var(Xn X) is simplyVar Xn. Condition (2) then becomesE(Xn) cand Var Xn 0, which implies thatXnconverges in probability toc.
7.6 An Application
In normal sampling, let S2n be the sample variance based on n observations. Lets show
thatS2n is a consistent estimateof the true variance2, that is, S2n
P 2. SincenS2n/2is2(n 1), we haveE(nS2n/2) = (n 1) and Var(nS2n/2) = 2(n 1). Thus E(S2n) =(n 1)2/n 2 and Var(S2n) = 2(n 1)4/n2 0, and the result follows.
'1/n
F (x)n
x
1 F (x)n
olimn 1
F(x)
x
o
1
Figure 7.1
Problems
1. Let X1, . . . , X n be independent, not necessarily identically distributed random vari-ables. Assume that the Xi have finite means i and finite variances
2i , and the
variances are uniformly bounded, i.e., for some positive number Mwe have 2i Mfor all i. Show that (Sn
E(Sn))/n converges in probability to 0. This is a general-
ization of the weak law of large numbers. For ifi = and 2i = 2 for all i, then
E(Sn) = n, so (Sn/n) P 0, i.e., Sn/n P .2. Toss an unbiased coin once. If heads, write down the sequence 10101010 . . . , and if
tails, write down the sequence 01010101 . . . . IfXn is the n-th term of the sequenceandX=X1, show that Xn converges to X in distribution but not in probability.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
31/113
7
3. LetX1, . . . , X nbe iid with finite mean and finite variance2. LetXnbe the sample
mean (X1 +
+Xn)/n. Find the limiting distribution of Xn, i.e., find a random
variable Xsuch that Xn d X.4. LetXn be uniformly distributed between n and n + 1. Show that Xn does not have a
limiting distribution. Intuitively, the probability has run away to infinity.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
32/113
8
Lecture 8. The Central Limit Theorem
Intuitively, any random variable that can be regarded as the sum of a large numberof small independent components is approximately normal. To formalize, we need thefollowing result, stated without proof.
8.1 Theorem
IfYn has moment-generating function Mn, Y has moment-generating function M, andMn(t) M(t) as n for all t in some open interval containing the origin, thenYn
d Y.
8.2 Central Limit Theorem
LetX1, X2, . . . be iid, each with finite mean , finite variance2, and moment-generating
functionM. Then
Yn=
ni=1 Xi n
n
converges in distribution to a random variable that is normal (0,1). Thus for large n,ni=1 Xi is approximately normal.
We will give an informal sketch of the proof. The numerator ofYn isn
i=1(Xi ),and the random variables Xi are iid with mean 0 and variance 2. Thus we mayassume without loss of generality that = 0. We have
MYn(t) = E[etYn ] = E
exp
t
n
n
i=1Xi
.
The moment-generating function ofn
i=1 Xi is [M(t)]n, so
MYn(t) =
M t
n
n.
Now if the density of the Xi isf(x), then
M t
n
=
exp tx
n
f(x) dx
=
1 +
txn
+ t2x2
2!n2+
t3x3
3!n3/23+ f(x) dx
= 1 + 0 + t2
2n+ t
3
36n3/23
+ t4
424n24
+
wherek = E[(Xi)k]. If we neglect the terms after t2/2n we have, approximately,
MYn(t) =
1 + t2
2n
n
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
33/113
9
which approaches the normal (0,1) moment-generating function et2/2 as n . This
argument is very loose but it can be made precise by some estimates based on Taylors
formula with remainder.
We proved that ifXn converges in probability to X, thenXn convergence in distribu-tion to X. There is a partial converse.
8.3 Theorem
IfXn converges in distribution to a constantc, then Xn converges in probability to X.
Proof. We estimate the probability that|Xn X| , as follows.P{|Xn X| } =P{Xn c + } + P{Xn c }
= 1
P
{Xn< c +
}+ P
{Xn
c
}NowP{Xn c + (/2)} P{Xn< c + }, so
P{|Xn c| } 1 P{Xn c + (/2)} + P{Xn c }
= 1 Fn(c + (/2)) + Fn(c ).
whereFn is the distribution function ofXn. But as long asx =c,Fn(x) converges to thedistribution function of the constant c, so Fn(x) 1 ifx > c, and Fn(x)0 ifx < c.ThereforeP{|Xn c| } 1 1 + 0 = 0 as n .
8.4 Remarks
IfY is binomial (n, p), the normal approximation to the binomialallows us to regard Yas approximately normal with mean np and variance npq (with q = 1 p). Accordingto Box, Hunter and Hunter, Statistics for Experimenters, page 130, the approximationworks well in practice ifn >5 and
1n
qp
p
q
< .3If, for example, we wish to estimate the probability thatY= 50 or 51 or 52, we may writethis probability as P{49.5 < Y < 52.5} , and then evaluate as if Y were normal withmean np and variance np(1 p). This turns out to be slightly more accurate in practicethan using P{50 Y 52}.
8.5 Simulation
Most computers an simulate a random variable that is uniformly distributed between 0and 1. But what if we need a random variable with an arbitrary distribution functionF?For example, how would we simulate the random variable with the distribution functionof Figure 8.1? The basic idea is illustrated in Figure 8.2. IfY =F(X) where Xhas the
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
34/113
10
continuous distribution function F, then Y is uniformly distributed on [0,1]. (In Figure8.2 we have, for 0
y
1, P
{Y
y
}=P
{X
x
}=F(x) = y.)
Thus ifXis uniformly distributed on [0,1] and w want Yto have distribution functionF, we setX= F(Y), Y = F1(X).
In Figure 8.1 we must be more precise:
Case 1. 0 X 3. LetX= (3/70)Y+ (15/70), Y= (70X 15)/3.Case 2. .3 X .8. LetY= 4, so P{Y = 4} =.5 as required.Case 3. .8 X 1. Let X= (1/10)Y + (4/10), Y = 10X 4.
In Figure 8.1, replace the F(y)-axis by an x-axis to visualize X versus Y. Ify = y0corresponds tox = x0 [i.e., x0 = F(y0)], then
P{Y y0} =P{X x0} =x0 = F(y0)
as desired.
-5 2 4 6
o
.3
.8
-
-
'' '
1F(y)
y
110y+4
10
3
70y+
15
70
Figure 8.1
Y = F(X)
X
y
x
1
Figure 8.2
Problems
1. Let Xn be gamma (n, ), i.e., Xn has the gamma distribution with parameters n and. Show that Xn is a sum ofn independent exponential random variables, and fromthis derive the limiting distribution ofXn/n.
2. Show that2(n) is approximately normal for largen (with mean n and variance 2n).
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
35/113
11
3. Let X1, . . . , X n be iid with density f. Let Yn be the number of observations thatfall into the interval (a, b). Indicate how to use a normal approximation to calculate
probabilities involvingYn.
4. If we have 3 observations 6.45, 3.14, 4.93, and we round off to the nearest integer, weget 6, 3, 5. The sum of integers is 14, but the actual sum is 14.52. Let Xi, i= 1, . . . , nbe the round-off error of the i-th observation, and assume that the Xi are iid anduniformly distributed on (1/2, 1/2). Indicate how to use a normal approximation tocalculate probabilities involving the total round off error Yn=
ni=1 Xi.
5. Let X1, . . . , X n be iid with continuous distribution function F, and let Y1 < < Ynbe the order statistics of the Xi. ThenF(X1), . . . , F (Xn) are iid and uniformly dis-tributed on [0,1] (see the discussion of simulation), with order statistics F(Y1), . . . , F (Yn).Show that n(1 F(Yn)) converges in distribution to an exponential random variable.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
36/113
12
Lecture 9. Estimation
9.1 Introduction
In effect the statistician plays a game against nature, who first chooses the state ofnature (a number or k-tuple of numbers in the usual case) and performs a randomexperiment. We do not know but we are allowed to observe the value of a randomvariable (or random vector)X, called the observable, with density f(x).
After observingX=x we estimate by (x), which is called a point estimatebecauseit produces a single number which we hope is close to . The main alternative is aninterval estimate or confidence interval, which will be discussed in Lectures 10 and 11.
For a point estimate (x) to make sense physically, it must depend only onx, not onthe unknown parameter . There are many possible estimates, and there are no generalrules for choosing a best estimate. Some practical considerations are:
(a) How much does it cost to collect the data?
(b) Is the performance of the estimate easy to measure, for example, can we computeP{|(x) | < }?(c) Are the advantages of the estimate appropriate for the problem at hand?
We will study several estimation methods:
1. Maximum likelihood estimates.
These estimates usually have highly desirable theoretical properties (consistency), andare frequently not difficult to compute.
2. Confidence intervals.
These estimates have a very useful practical feature. We construct an interval from
the data, and we will know the probability that our (random) interval actually containsthe unknown (but fixed) parameter.
3. Uniformly minimum variance unbiased estimates (UMVUEs).
Mathematical theory generates a large number of examples of these, but as we know,a biased estimate can sometimes be superior.
4. Bayes estimates.
These estimates are appropriate if it is reasonable to assume that the state of nature is a random variable with a known density.
In general, statistical theory produces many reasonable candidates, and practical ex-perience will dictate the choice in a given physical situation.
9.2 Maximum Likelihood Estimates
We choose(x) = , a value of that makes what we have observed as likely as possible.
In other words, let maximize the likelihood function L() = f(x), with x fixed. Thiscorresponds to basic statistical philosophy; if what we have observed is more likely under2 than under 1, we prefer 2 to1.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
37/113
13
9.3 Example
LetXbe binomial (n, ). Then the probability that X= x when the true parameter is is
f(x) =
n
x
x(1 )nx, x= 0, 1, . . . , n.
Maximizing f(x) is equivalent to maximizing ln f(x):
ln f(x) =
[x ln + (n x)ln(1 )] = x
n x
1 = 0.
Thus x x n + x= 0, so = X/n, the relative frequency of success.Notation: will be written in terms of random variables, in this case X/n rather than
x/n. Thus is itself a random variable.
We haveE() = n/n= , so is unbiased. By the weak law of large numbers, P ,i.e., is consistent
9.4 Example
LetX1, . . . , X n be iid, normal (, 2), = (, 2). Then, with x = (x1, . . . , xn),
f(x) =
1
2
nexp
ni=1
(xi )222
and
ln f(x) = n
2ln 2 n ln 1
22
ni=1
(xi )2
;
ln f(x) =
1
2
ni=1
(xi ) = 0,ni=1
xi n= 0, = x;
ln f(x) = n
+
1
3
ni=1
(xi )2 = n3 2 +1
n
ni=1
(xi )2
= 0
with = x. Thus
2 = 1
n
ni=1
(xi x)2 =s2.
Case 1. and are both unknown. Then = (X, S2).
Case 2. 2 is known. Then = and = Xas above. (Differentiation with respect to is omitted.)
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
38/113
14
Case 3. is known. Then = 2 and the equation (/) ln f(x) = 0 becomes
2 = 1n
ni=1
(xi )2
so
= 1
n
ni=1
(Xi )2.
The sample mean X is an unbiased and (by the weak law of large numbers) consistentestimate of . The sample variance S2 is a biased but consistent estimate of 2 (seeLectures 4 and 7).
Notation: We will abbreviate maximum likelihood estimate by MLE.
9.5 The MLE of a Function of Theta
Suppose that for a fixed x, f(x) is a maximum when = 0. Then the value of2 when
f(x) is a maximum is 20. Thus to get the MLE of
2, we simply square the MLE of .In general, ifh is any function, then h() = h().Ifh is continuous, then consistency is preserved, in other words:
Ifh is continuous and P , then h() P h().
Proof. Given > 0, there exists > 0 such that if| | < , then|h() h()| < .Consequently,
P{|h() h()| } P{|
| } 0 as n .
(To justify the above inequality, note that if the occurrence of an event A implies theoccurrence of an eventB , then P(A) P(B).)
9.6 The Method of Moments
This is sometimes a quick way to obtain reasonable estimates. We set the observed k-thmomentn1
ni=1 x
ki equal to the theoreticalk-th momentE(X
ki) (which will depend on
the unknown parameter). Or we set the observedk-th central momentn1n
i=1(xi)kequal to the theoretical k-th central moment E[(Xi )k]. For example, letX1, . . . X nbe iid, gamma with = 1, = 2, with 1, 2 > 0. Then E(Xi) = = 12 andVar Xi=
2 =122 (see Lecture 3). We set
X= 12, S2 =122
and solve to get estimates i ofi, i= 1, 2, namely
2 =S2
X, 1 =
X
2=
X2
S2
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
39/113
15
Problems
1. In this problem,X1, . . . , X n are iid with density f(x) or probability function p(x),and you are asked to find the MLE of.
(a) Poisson (), >0.
(b) f(x) = x1, 0< x < 1, where > 0. The probability is concentrated near the
origin when 1.
(c) Exponential with parameter , i.e., f(x) = (1/)ex/, x >0, where >0.
(d)f(x) = (1/2)e|x|, where andx are arbitrary real numbers.
(e) Translated exponential, i.e., f(x) =e(x), where is an arbitrary real number
and x .2. let X1, . . . , X n be iid, each uniformly distributed between (1/2) and + (1/2).
Find more than one MLE of (so MLEs are not necessarily unique).
3. In each part of Problem 1, calculateE(Xi) and derive an estimate based on the methodof moments by setting the sample mean equal to the true mean. In each case, showthat the estimate is consistent.
4. LetXbe exponential with parameter , as in Problem 1(c). Ifr >0, find the MLE ofP{X r}.
5. IfXis binomial (n, ) and a and b are integers with 0abn, find the MLE ofP{a X b}.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
40/113
16
Lecture 10. Confidence Intervals
10.1 Predicting an Election
There are two candidates A and B . If a voter is selected at random, the probability thatthe voter favors A is p, where p is fixed but unknown. We selectn voters independentlyand ask their preference.
The number Yn of A voters is binomial (n, p), which (for sufficiently large n), isapproximately normal with = np and 2 = np(1 p). The relative frequency of Avoters is Yn/n. We wish to estimate the minimum value ofn such that we can predictAs percentage of the vote within 1 percent, with 95 percent confidence. Thus we want
PYn
np < .01 > .95.
Note that|
(Yn/n)
p|< .01 means that p is within .01 ofYn/n. So this inequality can
be written as
Ynn
.01< p < Ynn
+ .01.
Thus the probability that the randominterval In= ((Yn/n) .01, (Yn/n) + .01) containsthe true probability p is greater than .95. We say that In is a 95 percent confidenceinterval forp.
In general, we find confidence intervals by calculating or estimating the probability ofthe event that is to occur with the desired level of confidence. In this case,
P
Ynn
p
< .01
=P{|Yn np| < .01n} =P
Yn np
np(1 p)
< .01
n
p(1 p)
and this is approximately
.01
n
p(1 p)
.01np(1 p)
= 2
.01
n
p(1 p)
1> .95
where is the normal (0,1) distribution function. Since 1.95/2 = .975 and (1.96) =.975,we have
.01
np(1 p) >1.96, n > (196)
2p(1 p).
But (by calculus) p(1 p) is maximized when 1 2p = 0, p = 1/2, p(1 p) = 1/4.Thus n >(196)2/4 = (98)2 = (100
2)2 = 10000
400 + 4 = 9604.
If we want to get within one tenth of one percent (.001) ofpwith 99 percent confidence,we repeat the above analysis with .01 replaced by .001, 1.99/2=.995 and (2 .6) = .995.Thus
.001
np(1 p) >2.6, n > (2600)
2/4 = (1300)2 = 1, 690, 000.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
41/113
17
To get within 3 percent with 95 percent confidence, we have
.03np(1 p) >1.96, n > 1963
2
14
= 1067.
If the experiment is repeated independently a large number of times, it is very likely thatour result will be within .03 of the true probabilityp at least 95 percent of the time. Theusual statement The margin of error of this poll is3% does not capture this idea.
Note that the accuracy of the prediction depends only on the number of voters polledand not in total number of votes in the population. But the model assumes samplingwith replacement. (Theoretically, the same voter can be polled more than once since thevoters are selected independently.) In practice, sampling is done without replacement,but if the number n of voters polled is small relative to the population size N, the erroris very small.
The normal approximation to the binomial (based on the central limit theorem) is
quite reliable, and is used in practice even for modest values ofn; see (8.4).
10.2 Estimating the Mean of a Normal Population
LetX1, . . . , X n be iid, each normal (, 2). We will find a confidence interval for .
Case 1. The variance 2 is known. Then Xis normal (, 2/n), so
X /
n
is normal (0,1),
hence
P{b < n
X
< b} = (b) (b) = 2(b) 1
and the inequality defining the confidence interval can be written as
X bn
< < X+ b
n.
We choose a symmetrical interval to minimize the length, because the normal densitywith zero mean is symmetric about 0. The desired confidence level determines b, whichthen determines the confidence interval.
Case 2. The variance 2 is unknown. Recall from (5.1) that
X S/
n 1 is T(n 1)
hence
P{b < X S/
n 1 < b} = 2FT(b) 1
and the inequality defining the confidence interval can be written as
X bSn 1 < < X+
bSn 1 .
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
42/113
18
10.3 A Correction Factor When Sampling Without Replacement
The following results will not be used and may be omitted, but it is interesting to measurequantitatively the effect of sampling without replacement. In the election predictionproblem, let Xi be the indicator of success (i.e.,selecting an A voter) on trial i. ThenP{Xi = 1} =p and P{Xi = 0} = 1 p. If sampling is done with replacement, then theXi are independent and the total numberX=X1+ + Xn ofA voters in the sample isbinomial (n, p). Thus the variance ofXis np(1p). However, if sampling is done withoutreplacement, then in effect we are drawing n balls from an urn containing Nballs (whereNis the size of the population), with N pballs labeledA and N(1 p) labeledB . Recallfrom basic probability theory that
Var X=ni=1
Var Xi+ 2i
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
43/113
19
Problems
1. In the normal case [see (10.2)], assume that2
is known. Explain how to compute thelength of the confidence interval for .
2. Continuing Problem 1, assume that2 is unknown. Explain how to compute the lengthof the confidence interval for , in terms of the sample standard deviation S.
3. Continuing Problem 2, explain how to compute the expected length of the confidenceinterval for , in terms of the unknown standard deviation . (Note that when isunknown, we expect a larger interval since we have less information.)
4. Let X1, . . . , X n be iid, each gamma with parameters and . If is known, explainhow to compute a confidence interval for the mean = .
5. In the binomial case [see (10.1)], suppose we specify the level of confidence and thelength of the confidence interval. Explain how to compute the minimum value ofn.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
44/113
1
Lecture 11. More Confidence Intervals
11.1 Differences of Means
Let X1, . . . , X n be iid, each normal (1, 2), and let Y1, . . . , Y m be iid, each normal
(2, 2). Assume that (X1, . . . X n) and Y1, . . . , Y m) are independent. We will construct
a confidence interval for 1 2. In practice, the interval is often used in the followingway. If the interval lies entirely to the left of 0, we have reason to believe that1 < 2.
Since Var(X Y) = Var X+ Var Y = (2/n) + (2/m),X Y (1 2)
1n +
1m
is normal (0,1).
Also,nS21/2 is2(n1) andmS22/2 is2(m1). But2(r) is the sum ofr independent,
normal (0,1) random variables, so
nS212
+mS22
2 is 2(n + m 2).
Thus if
R=
nS21 + mS
22
n + m 2
1
n+
1
m
then
T =X Y (1 2)
R is T(n + m 2).
Our assumption that both populations have the same variance is crucial, because theunknownvariance can be cancelled.
IfP{b < T < b} =.95 we get a 95 percent confidence interval for 1 2:
b < X Y (1 2)R
< b
or
(X Y) bR < 1 2 < (X Y) + bR.If the variances 21 and
22 areknown but possibly unequal, then
X Y (1 2)2
1
n + 2
2
m
is normal (0,1). IfR0 is the denominator of the above fraction, we can get a 95 percentconfidence interval as before: (b) (b) = 2(b) 1> .95,
(X Y) bR0 < 1 2 < (X Y) + bR0.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
45/113
2
11.2 Example
LetY1 andY2 be binomial (n1, p1) and (n2, p2) respectively. Then
Y1 = X1+ + Xn1 and Y2 = Z1+ + Zn2where the Xi and Zj are indicators of success on trials i and j respectively. Assumethat X1, . . . X n1 , Z1, . . . , Z n2 are independent. Now E(Y1/n1) = p1 and Var(Y1/n1) =n1p1(1 p1)/n21 = p1(1p1)/n1, with similar formulas for Y2/n2. Thus for large n,
Y1n1
Y2n2
(p1 p2)
divided by
p1(1p1)n1
+p2(1 p2)
n2
is approximately normal (0,1). But this expression cannot be used to construct confidenceintervals forp1p2 because the denominator involves the unknown quantitiesp1 andp2.However, Y1/n1 converges in probability to p1 and Y2/n2 converges in probability to p2,and this justifies replacing p1 byY1/n1 andp2 byY2/n2 in the denominator.
11.3 The Variance
We will construct confidence intervals for the variance of a normal population. LetX1, . . . , X n be iid, each normal (,
2), so that nS2/2 is 2(n1). If hn1 is the2(n 1) density and a and b are chosen to that ba hn1(x) dx= 1 , then
P{a < nS2
2 < b} = 1 .
Buta
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
46/113
3
so if
W=
ni=1
(Xi )2
and we choose a and b so thatba
hn(x) dx = 1 , then P{a < (W/2) < b} = 1 .The inequality defining the confidence interval can be written as
W
b < 2 0 might mean thatthe drug is a significant improvement.
We observex and make a decision via (x) = 0 or 1. There are two types of errors. Atype 1 erroroccurs ifH0 is true but(x) = 1, in other words, we declare that H1 is true.Thus in a type 1 error, we rejectH0 when it is true.
Atype 2 erroroccurs ifH0 is false but(x) = 0, i.e., we declare that H0 is true. Thusin a type 2 error, we acceptH0 when it is false.
IfH0 [resp. H1] means that a patient does not have [resp. does have] a particulardisease, then a type 1 error is also called a false positive, and a type 2 error is also calleda false negative.
If(x) is always 0, then a type 1 error can never occur, but a type 2 error will alwaysoccur. Symmetrically, if(x) is always 1, then there will always be a type 1 error, butnever an error of type 2. Thus by ignoring the data altogether we can reduce one of theerror probabilities to zero. To get botherror probabilities to be mall, in practice we mustincrease the sample size.
We say that H0 [resp. H1] is simple if A0 [resp. A1] contains only one element,composite if A0 [resp. H1] contains more than one element. So in the case of simplehypothesis vs. simple alternative, we are testing = 0 vs. = 1. The standard exampleis to test the hypothesis that Xhas density f0 vs. the alternative that Xhas density f1.
12.2 Likelihood Ratio Tests
In the case of simple hypothesis vs. simple alternative, if we require that the probabilityof a type 1 error be at most and try to minimize the probability of a type 2 error, theoptimal test turns out to be a likelihood ratio test (LRT), defined as follows. LetL(x),the likelihood ratio, be f1(x)/f0(x), and let be a constant. IfL(x) > , reject H0; ifL(x)< , acceptH0; ifL(x) = , do anything.
Intuitively, if what we have observed seems significantly more likely under H1, we willtend to rejectH0. IfH0 or H1 is composite, there is no general optimality result as thereis in the simple vs. simple case. In this situation, we resort to basic statistical philosophy:
If, assuming thatH0 is true, we witness a are event, we tend to reject H0.The statement that LRTs are optimal is the Neyman-Pearson lemma, to be proved
at the end of the lecture. In many common examples (normal, Poisson, binomial, ex-ponential,L(x1, . . . , xn) can be expressed as a function of the sum of the observations,or equivalently as a function of the sample mean. This motivates consideration of testsbased on
ni=1 Xi or on X.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
48/113
5
12.3 Example
Let X1, . . . , X n be iid, each normal (, 2
). We will testH0 : 0 vs. H1 : > 0.UnderH1, Xwill tend to be larger, so lets reject H0 when X > c. The power functionof the test is defined by
K() = P{rejectH0},the probability of rejecting the null hypothesis when the true parameter is . In this case,
P{X > c} =P
X /
n >
c /
n
= 1
c /
n
(see Figure 12.1). Suppose that we specify the probability of a type 1 error when = 1,and the probability of a type 2 error when = 2. Then
K(1) = 1 c
1
/n =and
K(2) = 1
c 2/
n
= 1 .
If, ,,1 and2 are known, we have two equations that can be solved for c and n.
1
K( )
0 21
Figure 12.1
The critical region is the set of observations that lead to rejection. In this case, it is{(x1, . . . , xn) : n1
ni=1 xi > c}.
The significance level is the largest type 1 error probability. Here it is K(0), sinceK() increases with .
12.4 Example
Let H0 : Xis uniformly distributed on (0,1), so f0(x) = 1, 0 < x < 1, and 0 elsewhere.Let H1 : f1(x) = 3x
2, 0 < x < 1, and 0 elsewhere. We take only one observation, andrejectH0 ifx > c, where 0< c < 1. Then
K(0) =P0{X > c} = 1 c, K(1) =P1{X > c} = 1c
3x2 dx= 1 c3.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
49/113
6
If we specify the probability of a type 1 error, then = 1 c, which determines c. Ifis the probability of a type 2 error, then 1
= 1
c3, so= c3. Thus (see Figure 12.2)
= (1 )3.If = .05 then= (.95)3 .86, which indicates that you usually cant do too well withonly one observation.
1
1
Figure 12.2
12.5 Tests Derived From Confidence Intervals
LetX1, . . . , X nbe iid, each normal (0, 2). In Lecture 10, we found a confidence interval
for0, assuming 2 unknown, via
P
b < X 0
S/
n 1 < b
= 2FT(b) 1 where T = X 0S/
n 1has the Tdistribution with n 1 degrees of freedom.
Say 2FT(b) 1 = .95, so that
P X 0S/n 1
b =.05If actually equals 0, we are witnessing an event of low probability. So it is natural totest = 0 vs. =0 by rejecting if X 0S/n 1
b,in other words, 0 does not belong to the confidence interval. As the true mean moves away from 0 in either direction, the probability of this event will increase, sinceX 0 = (X ) + ( 0).
Tests of = 0 vs. = 0 are called two-sided, as opposed to = 0 vs. > 0 (or= 0 vs. < 0), which are one-sided. In the present case, if we test = 0 vs. > 0,
we reject ifX 0
S/
n 1 b.
The power function K() is difficult to compute for = 0, because (X 0)/(/n)no longer has mean zero. The noncentral Tdistribution becomes involved.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
50/113
7
12.6 The Neyman-Pearson Lemma
Assume that we are testing the simple hypothesis that X has density f0 vs. the simplealternative that X has density f1. Let be an LRT with parameter (a nonnegativeconstant), in other words, (x) is the probability of rejecting H0 when x is observed,and
(x) =
1 ifL(x)>
0 ifL(x)<
anything ifL(x) =
Suppose that the probability of a type 1 error using is , and the probability of atype 2 error is . Let be an arbitrary test with error probabilities and . If then . In other words, the LRT has maximum power among all tests at significancelevel .
Proof. We are going to assume that f0 and f1 are one-dimensional, but the argumentworks equally well when X= (X1, . . . , X n) and the fi aren-dimensional joint densities.We recall from basic probability theory the theorem of total probability, which says thatifXhas density f, then for any evert A,
P(A) =
P(A|X= x)f(x) dx.
A companion theorem which we will also use later is the theorem of total expectation,which says that ifXhas density f, then for any random variable Y,
E(Y) =
E(Y|X= x)f(x) dx.
By the theorem of total probability,
=
(x)f0(x) dx, 1 =
(x)f1(x) dx
and similarly
=
(x)f0(x) dx, 1 =
(x)f1(x) dx.
We claim that for all x,
[(x) (x)][f1(x) f0(x)] 0.
For iff1(x) > f0(x) then L(x) > , so (x) = 1 (x), and iff1(x) < f0(x) thenL(x) < , so (x) = 0 (x), proving the assertion. Now if a function is alwaysnonnegative, its integral must be nonnegative, so
[(x) (x)][f1(x) f0(x)] dx 0.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
51/113
8
The terms involving f0translate to statements about type 1 errors, and the terms involvingf1 translate to statements about type 2 errors. Thus
(1 ) (1 ) + 0,
which says that ( ) 0, completing the proof.
12.7 Randomization
IfL(x) = , then do anything means that randomization is possible, e.g., we can flipa possibly biased coin to decide whether or not to accept H0. (This may be significantin the discrete case, where L(x) = may have positive probability.) Statisticians tendto frown on this practice because two statisticians can look at exactly the same data andcome to different conclusions. It is possible to adjust the significance level (by replacingdo anything by a definite choice of eitherH0 orH1 to avoid randomization.
Problems
1. Consider the problem of testing = 0 vs. > 0, where is the mean of a normalpopulation with known variance. Assume that the sample size n is fixed. Show thatthe test given in Example 12.3 (reject H0 ifX > c) is uniformly most powerful. Inother words, if we test = 0 vs. = 1 for any given 1 > 0, and we specify theprobability of a type 1 error, then the probability of a type 2 error is minimized.
2. It is desired to test the null hypothesis that a die is unbiased vs. the alternative thatthe die is loaded, with faces 1 and 2 having probability 1/4 and faces 3,4,5 and 6 havingprobability 1/8. The die is to be tossed once. Find a most powerful test at level = .1,and find the type 2 error probability .
3. We wish to test a binomial random variable X with n = 400 and H0 : p = 1/2 vs.H1 : p > 1/2. The random variable Y = (X np)/
np(1 p) = (X 200)/10 isapproximately normal (0,1), and we will reject H0 if Y > c. If we specify = .05,then c = 1.645. Thus the critical region is X > 216.45. Suppose the actual result isX= 220, so that H0 is rejected. Find the minimum value of (sometimes called thep-value) for which the givendata lead to theoppositeconclusion (acceptance ofH0).
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
52/113
9
Lecture 13. Chi-Square Tests
13.1 Introduction
Let X1, . . . , X k be multinomial, i.e., Xi is the number of occurrences of the event Ai inn generalized Bernoulli trials (Lecture 6). Then
P{X1= n1, . . . , X k = nk} = n!n1! nk!p
n11 pnkk
where the ni are nonnegative integers whose sum is n. Consider k = 2. Then X1 isbinomial (n, p1) and (X1np1)/
np1(1 p1) normal(0,1). Consequently, the random
variable (X1 np1)2/np1(1 p1) is approximately 2(1). But
(X1
np1)2
np1(1 p1) =(X1
np1)
2
n 1
p1 +
1
1 p1
=
(X1
np1)2
np1 +
(X2
np2)2
np2 .
(Note that since k = 2 we have p2 = 1 p1 and X1 np1 = n X2 np1 = np2X2 =(X2 np2), and the outer minus sign disappears when squaring.) Therefore[(X1 np1)2/np1] + [(X2 np2)2/np2] 2(1). More generally, it can be shown that
Q=k
i=1
(Xi npi)2npi
2(k 1).
where
(Xi
npi)
2
npi =
(observed frequency-expected frequency)2
expected frequency .
We will consider three types of chi-square tests.
13.2 Goodness of Fit
We ask whetherXhas a specified distribution (normal, Poisson, etc.). The null hypothesisis that the multinomial probabilities are p = (p1, . . . , pk), and the alternative is that
p = (p1, . . . , pk).Suppose that P{2(k 1) > c} is at the desired level of significance (for example,
.05). IfQ > c we will rejectH0. The idea is that ifH0 is in fact true, we have witnesseda rare event, so rejection is reasonable. IfH
0is false, it is reasonable to expect that some
of theXi will be far from npi, so Q will be large.
Some practical considerations: Taken large enough so that each npi 5. Each time aparameter is estimated from the sample, reduced the number of degrees by 1. (A typicalcase: The null hypothesis is thatX is Poisson (), but the mean is unknown, and isestimated by the sample mean.)
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
53/113
10
13.3 Equality of Distributions
We ask whether two or more samples come from the same underlying distribution. Theobserved results are displayed in a contingency table. This is anh k matrix whose rowsare the samples and whose columns are the attributes to be observed. For example, rowi might be (7, 11, 15, 13, 4), with the interpretation that in a class of 50 students taughtby method of instruction i, there were 7 grades ofA, 11 ofB, 15 ofC, 13 ofD and 4ofF. The null hypothesisH0 is that there is no difference between the various methodsof instruction, i.e., P(A) is the same for each group, and similarly for the probabilities ofthe other grades. We estimateP(A) from the sample by adding all entries in column Aand dividing by the total number of observations in the entire experiment. We estimateP(B), P(C), P(D) and P(F) in a similar fashion. The expected frequencies in row i arefound by multiplying the grade probabilities by the number of entries in row i.
If there arehgroups (samples), each withkattributes, then each group generates a chi-square (k
1), andk
1 probabilities are estimated from the sample (the last probability
is determined). The number of degrees of freedom is h(k 1) (k 1) = (h 1)(k 1),call it r. IfP{2(r)> c} is the desired significance level, we reject H0 if the chi-squarestatistic is greater than c.
13.4 Testing For Independence
Again we have a contingency table with h rows corresponding to the possible values xi ofa random variableX, andk columns corresponding to the possible valuesyj of a randomvariable Y . We are testing the null hypothesis that X andY are independent.
Let Ri be the sum of the entries in row i, and let Cj be the sum of the entriesin column j. Then the sum of all observations is T =
i Ri =
jCj . We estimate
P{X=xi} by Ri/T, and P{Y =yj} by Cj/T. Under the independence hypothesis H0,P{
X= xi, Y =yj}
= P{
X=xi}
P{
Y = yj}
= RiCj/T2. Thus the expected frequency
of (xi, yj) is RiCj/T. (This gives another way to calculate the expected frequencies in(13.3). In that case, we estimated the j-th column probability by Cj/T, and multipliedby the sum of the entries in row i, namelyRi.)
In an h k contingency table, the number of degrees of freedom is hk 1 minus thenumber of estimated parameters:
hk 1 (h 1 + k 1) = hk h k+ 1 = (h 1)(k 1).The chi-square statistic is calculated as in (13.3). Similarly, if there are 3 attributes tobe tested for independence and we form an h k m contingency table, the number ofdegrees of freedom is
hkm 1 (h 1) + (k 1) + (m 1) = hkm h k m + 2.
Problems
1. Use a chi-square procedure to tests the null hypothesis that a random variableX hasthe following distribution:
P{X= 1} =.5, P{X= 2} =.3, P{X= 3} =.2
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
54/113
11
We take 100 independent observations ofX, and it is observed that 1 occurs 40 times,2 occurs 33 times, and 3 occurs 27 times. Determine whether or not we will reject the
null hypothesis at significance level .05
2. Use a chi-square test to decide (at significance level .05) whether the two samples cor-responding to the rows of the contingency table below came from the same underlyingdistribution.
A B CSample 1 33 147 114Sample 2 67 153 86
3. Suppose we are testing for independence in a 2 2 contingency tablea bc d
Show that the chi-square statistic is
(ad bc)2(a + b + c + d)(a + b)(c + d)(a + c)(b + d)
(The number of degrees of freedom is 1 1 = 1.)
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
55/113
12
Lecture 14. Sufficient Statistics
14.1 Definitions and Comments
Let X1, . . . , X n be iid with P{Xi = 1} = and P{Xi = 0} = 1 , so P{Xi = x} =x(1)1x, x = 0, 1. Let Y be a statistic for , i.e., a function of the observablesX1, . . . , X n. In this case we take Y =X1+ + Xn, the total number of successes in nBernoulli trials with probability of success on a given trial.
We claim that the conditional distribution ofX1, . . . , X ngivenYis free of, in otherwords, does not depend on . We say that Y is sufficient for .
To prove this, note that
P{X1 = x1, . . . , X n= xn|Y =y} = P{X1 = x1, . . . , X n= xn, Y =y}P{Y =y} .
This is 0 unless y = x1+ + xn, in which case we gety(1 )nyny
y(1 )ny =
1ny
.For example, if we know that there were 3 heads in 5 tosses, the probability that theactual tosses were HTHHT is 1/
53
.
14.2 The Key Idea
For the purpose of making a statistical decision, we can ignore the individual randomvariables Xi and base the decision entirely on X1+ + Xn.
Suppose that statistician A observes X1, . . . , X n and makes a decision. StatisticianB observes X1+ +Xn only, and constructs X1, . . . , X n according to the conditionaldistribution ofX1, . . . , X n given Y, i.e.,
P{X1 = x1, . . . , X n= xn|Y =y} = 1ny
.This construction is possible because the conditional distribution does not depend on theunknown parameter . We will show that under , (X1, . . . , X
n) and (X1, . . . , X n) have
exactly the same distribution, so anything A can do, B can do at least as well, even thoughB has less information.
Givenx1, . . . , xn, lety = x1 + + xn. The only way we can haveX1 = x1, . . . , X n=xn is ifY =y and then Bs experiment produces X1 = x1, . . . , X
n= xn given y . Thus
P{X1 = x1, . . . , X n= xn} =P{Y =y}P{X1 = x1, . . . , X n= xn|Y =y}
=
n
y
y(1 )ny 1n
y
=y(1 )ny =P{X1 = x1, . . . , X n= xn}.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
56/113
13
14.3 The Factorization Theorem
Let Y = u(X) be a statistic for ; (X can be (X1, . . . , X n), and usually is). ThenY is sufficient for if and only if the density f(x) of X under can be factored asf(x) = g(, u(x))h(x).
[In the Bernoulli case, f(x1, . . . , xn) = y(1 )ny wherey = u(x) =ni=1 xi and
h(x) = 1.]
Proof. (Discrete case). IfY is sufficient, then
P{X=x} =P{X=x, Y =u(x)} =P{Y =u(x)}P{X= x|Y =u(x)}
=g(, u(x))h(x).
Conversely, assume f(x) = g(, u(x))h(x). Then
P{X=x|Y =y} = P{X= x, Y =y}P{Y =y} .
This is 0 unless y = u(x), in which case it becomes
P{X= x}P{Y =y} =
g(, u(x))h(x){z:u(z)=y} g(, u(z))h(z)
.
The g terms in both numerator and denominator are g(, y), which can be cancelled toobtain
P
{X= x
|Y =y
}=
h(x){z:u(z)=y} h(z)
which is free of .
14.4 Example
LetX1, . . . , X n be iid, each normal (, 2), so that
f(x1, . . . , xn) =
1
2
nexp
1
22
ni=1
(xi )2
.
Take = (, 2) and let x = n1
ni=1 xi, s
2 =n1
ni=1(xi x)2. Then
xi x= xi (x )and
s2 = 1
n
n1
(xi )2 2(x )n1
(xi ) + n(x )2
.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
57/113
14
Thus
s2 = 1n
n1
(xi )2 (x )2.
The joint density is given by
f(x1, . . . , xn) = (22)n/2ens
2/22en(x)2/22 .
If and 2 are both unknown then (X, S2) is sufficient (take h(x) = 1). If2 is known
then we can take h(x) = (22)n/2ens2/22 , = , and X is sufficient. If is known
then (h(x) = 1) = 2 andn
i=1(Xi )2 is sufficient.
Problems
In Problems 1-6, show that the given statistic u(X) = u(X1, . . . , X n) is sufficient for and find appropriate functions g and h for the factorization theorem to apply.
1. TheXi are Poisson () andu(X) = X1+ + Xn.2. TheXihave densityA()B(xi), 0< xi< (and 0 elsewhere), where is a positive real
number; u(X) = max Xi. As a special case, the Xi are uniformly distributed between0 and , andA() = 1/,B(xi) = 1 on (0, ).
3. TheXi are geometric with parameter , i.e., if is the probability of success on a givenBernoulli trial, then P{Xi = x} = (1 )x is the probability that there will be xfailures followed by the first success; u(X) =
ni=1 Xi.
4. TheXi have the exponential density (1/)ex/, x >0, and u(X) =
ni=1 Xi.
5. TheXi have the beta density with parameters a = and b = 2, and u(X) = ni=1 Xi.6. The Xi have the gamma density with parameters = , an arbitrary positive
number, and u(X) =n
i=1 Xi.
7. Show that the result in (14.2) that statistician B can do at least as well as statisticianA, holds in the general case of arbitrary iid random variables Xi.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
58/113
15
Lecture 15. Rao-Blackwell Theorem
15.1 Background From Basic Probability
To better understand the steps leading to the Rao-Blackwell theorem, consider a typicaltwo stage experiment:
Step 1. Observe a random variableXwith density (1/2)x2ex, x > 0.
Step 2. IfX= x, let Y be uniformly distributed on (0, x).
FindE(Y).
Method 1 via the joint density:
f(x, y) = fX(x)fY(y|x) = 12
x2ex(1
x) =
1
2xex, 0< y < x.
In general, E[g(X, Y)] =
g(x, y)f(x, y) dxdy. In this case, g(x, y) = y and
E(Y) =
x=0
xy=0
y(1/2)xex dydx=
0
(x3/4)ex dx=3!
4 =
3
2.
Method 2via the theorem of total expectation:
E(Y) =
fX(x)E(Y|X= x) dx.
Method 2 works well when the conditional expectation is easy to compute. In this caseit is x/2 by inspection. Thus
E(Y) =0
(1/2)x2ex(x/2) dx=3
2 as before.
15.2 Comment On Notation
If, for example, it turns out that E(Y|X = x) = x2 + 3x+ 4, we can write E(Y|X) =X2 + 3X+ 4. Thus E(Y|X) is a function g(X) of the random variable X. When X= xwe haveg (x) = E(Y|X=x).
We now proceed to the Rao-Blackwell theorem via several preliminary lemmas.
15.3 Lemma
E[E(X2
|X1)] = E(X2).
Proof. Let g(X1) = E(X2|X1). Then
E[g(X1)] =
g(x)f1(x) dx=
E(X2|X1 = x)f1(x) dx= E(X2)
by the theorem of total expectation.
-
5/19/2018 Ash - 2007 - Lectures On Statistics.pdf
59/113
16
15.4 Lemma
Ifi= E(Xi), i= 1, 2, thenE[{X2 E(X2|X1)}{E(X2|X1) 2}] = 0.
Proof. The expectation is
[x2 E(X2|X1 = x1)][E(X2|X1= x1) 2]f1(x1)f2(x2|x1) dx1 dx2
=
f1(x1)[E(X2|X1 = x1) 2]
[x2 E(X2|X1 = x1)]f2(x2|x1) dx2 dx1.
The inner integral (with respect to x2) isE(X2|X1 = x1) E(X2|X1 = x1) = 0, and theresult follows.
15.5 Lemma
Var X2 Var[E(X2|X1)].Proof. We have
Var X2 = E[(X2 2)2] = E
[{X2 E(X2|X1} + {E(X2|X1) 2}]2