mathematical statistics - tennessee techmath.tntech.edu/isr/mathematical_statistics/thispage.pdf ·...

Mathematical Statistics

Motoya Machida

April 12, 2006

This material is designed to introduces basic concepts and fundamental theory of mathematical statistics. Areview of basic concepts will include likelihood functions, sufficient statistics, and exponential family of distri-butions. Then point estimation will be discussed, including minimum variance unbiased estimates, Cramer-Raoinequality, maximum likelihood estimates and asymptotic theory. Topics in general theory of statistical testsinclude Neyman–Pearson theorem, uniformly most powerful tests, and likelihood ratio tests.

1 Point estimates

A random sampleX1, . . . , Xn

is regarded as independent and identically distributed (iid) random variables governed by an underlying prob-ability density function f(x; θ). A value θ represents the characteristics of this underlying distribution, andis called a parameter. Suppose, for example, that the underlying distribution is the normal distribution with(µ, σ2). Then the values µ and σ2 are the parameters. Since X1, . . . , Xn are random variables, the sample meanX also becomes a random variable. In general, a random variable u(X) constructed from the random vectorX = (X1, . . . , Xn) is called a statistic. For example, the sample mean X is a statistic.

A point estimate is a statistic u(X) which is a “best guess” for the true value θ. Suppose that the underlyingdistribution is the normal distribution with (µ, σ2). Then the sample mean X is in some sense a best guess ofthe parameter µ.

Mean square-error. Let u(X) be a point estimate for θ. Then the functional R(θ, u) = E[(u(X) − θ)2] of uis called the mean square-error risk function. We can immediately observe that

R(θ, u) = Var(u(X)) + [E(u(X))− θ]2 = Var(u(X)) + [b(θ, u)]2,

where b(θ, u) = E(u(X))− θ is called the bias of u(X).

2 Maximum likelihood estimate

Having observed a random sample (X1, . . . , Xn) = (x1, . . . , xn) from an underlying pdf f(x; θ), we can constructthe likelihood function

L(θ,x) =n∏

i=1

f(xi; θ),

1


and consider it as a function of θ. Then the maximum likelihood estimate (MLE) θ is the value of θ which“maximizes” the likelihood function L(θ,x). It is usually easier to maximize the log likelihood

lnL(θ,x) =n∑

i=1

ln f(xi; θ).

2.1 Bernoulli trials

Let f(x; θ) = θx(1−θ)1−x be the Bernoulli frequency function with success probability θ. By solving the equation

∂ lnL(θ,x)∂θ

= (∑n

i=1 xi)1θ− (n−

∑ni=1 xi)

11− θ

= 0,

we obtain θ∗ =∑n

i=1 xi

/n which minimizes lnL(θ,x). Therefore, θ = X is the MLE of θ.

2.2 Normal distribution

Let X1, . . . , Xn be a random sample from a normal distribution with parameter (µ, σ2). Then the log likelihoodfunction of parameter (µ, σ2) is given by

lnL(µ, σ2) = −n2

lnσ2 − 12σ2

n∑i=1

(xi − µ)2 − n

2ln 2π.

By solving

∂ lnL(µ, σ2)∂µ

=1σ2

n∑i=1

(xi − µ) = 0;

∂ lnL(µ, σ2)∂σ2

= − n

2σ2+

12(σ2)2

n∑i=1

(xi − µ)2 = 0,

we can obtain the MLE’s µ and σ2 as follows.

µ =1n

n∑i=1

Xi = X;

σ2 =1n

n∑i=1

(Xi − µ)2.

Although the MLE σ2 of σ2 is consistent, it should be noted that this point estimate σ2 is not an unbiased one.

2.3 Poisson distribution

Let X1, . . . , Xn be a random sample from a Poisson distribution with parameter λ. Then the log likelihoodfunction of parameter λ is given by

lnL(λ) = lnλn∑

i=1

xi − nλ−n∑

i=1

lnxi!.

By solvingd lnL(λ)

dλ=

1λ

n∑i=1

xi − n = 0,

Lecture Note


we can obtain the MLE

λ =1n

n∑i=1

Xi = X.

2.4 Exercise

Let X1, . . . , Xn be a random sample from each of the following density function f(x; θ). Find the MLE θ of θ.

1. f(x; θ) = (1 + θ)xθ, 0 ≤ x ≤ 1, where θ > −1.

2. f(x; θ) = θe−θx, x ≥ 0, where θ > 0.

3. f(x; θ) = θLθx−θ−1, x ≥ L, where L > 0 is given and θ > 1 (Pareto distribution).

2.5 Solutions to exercise

1. lnL(θ) =

(n∑

i=1

lnxi

)θ + n ln(1 + θ). By solving d

dθ lnL(θ) =

(n∑

i=1

lnxi

)+

n

1 + θ= 0, we obtain θ =

− n∑ni=1 lnxi

− 1.

2. lnL(θ) = −

(n∑

i=1

xi

)θ + n ln θ. By solving

d

dθlnL(θ) = −

(n∑

i=1

xi

)+n

θ= 0, we obtain θ =

n∑ni=1 xi

.

3. lnL(θ) = −

(n∑

i=1

lnxi

)(θ + 1) + n ln θ + nθ lnL. By solving

d

dθlnL(θ) = −

(n∑

i=1

lnxi

)+n

θ+ n lnL = 0,

we obtain θ =n∑n

i=1 lnxi − n lnL.

3 Properties of MLE

3.1 Consistency

One of the important attributes of point estimate is unbiasedness. Since a statistic θ is a random variable, wecan consider the expectation E(θ) of θ. Then the point estimate θ of θ is unbiased if it satisfies E(θ) = θ. In thecase of normal distribution, the MLE X for µ is unbiased since E(X) = µ. However, the MLE σ2 for σ2 is notunbiased, since

E

[1n

n∑i=1

(Xi − X)2]

=n− 1n

σ2

Note that the point estimate θ of θ is also dependent on the sample size n. We say that θ is consistent if θconverges in probability to θ as n → ∞. For example, the above MLE’s X and σ2 are both consistent by theweak law of large number. In general, the MLE is consistent under appropriate conditions.

3.2 Invariance

Suppose that h(θ) is a one-to-one function of θ. Then it is clearly seen that θ is the MLE for θ if and only ifh(θ) is the MLE for h(θ). Even if h(θ) is not one-to-one, h(θ) will be viewed as the MLE which corresponds tothe maximum likelihood.

Lecture Note


3.3 Asymptotic normality

Suppose that θ is the MLE and consistent, and that it satisfies∂ lnL(θ,X)

∂θ= 0. By the Taylor expansion we

have∂ lnL(θ,X)

∂θ+ (θ − θ)

∂2 lnL(θ,X)∂θ2

≈ ∂ lnL(θ,X)∂θ

= 0.

Since θ is close to θ by consistency, the approximation is valid. Furthermore, we can make the followingobservations:

1. The random variables(

∂ ln f(Xi;θ)∂θ

)’s are iid with mean 0 and variance I1(θ) = Var

(∂∂θ ln f(X1; θ)

). By

the central limit theorem,

Zn =

∂ lnL(θ,X)∂θ√nI1(θ)

=

n∑i=1

∂ ln f(Xi; θ)∂θ√

nI1(θ)

converges to N(0, 1) in distribution as n→∞.

2. The random variables(

∂2 ln f(Xi;θ)∂θ2

)’s are iid with mean (−I1(θ)) and finite variance. By the weak law of

large number,

Wn =1n

∂2 lnL(θ,X)∂θ2

=1n

n∑i=1

∂2 ln f(Xi; θ)∂θ2

converges to −I1(θ) in probability as n→∞.

Together with Slutsky’s theorem we can find that√nI1(θ)(θ − θ) ≈ Zn

Wn /(−I1(θ))

converges to N(0, 1) in distribution as n → ∞. Hence, the MLE θ has “approximately” a normal distributionwith mean θ and variance 1

nI1(θ) = 1I(θ) if n is large, where I(θ) is the Fisher information for the random

sample X. This suggests thatn×Var(θ) → 1/I1(θ) as n→∞.

Then we call θ asymptotically efficient (cf. Bickel and Doksum, “Mathematical Statistics,” Chapter 4).

4 Confidence interval

Let X = (X1, . . . , Xn) be a random sample from f(x; θ). Let u1(X) and u2(X) be statistics satisfying u1(X) ≤u2(X). If

P (u1(X) < θ < u2(X)) = 1− α for every θ,

then the random interval (u1(X), u2(X)) is called a confidence interval of level (1− α).

4.1 Population mean

Let X1, . . . , Xn be iid random variables from N(µ, σ). The sample mean X is an unbiased estimate of the

parameter µ. Then the random variableX − µ

S/√n

has the t-distribution with (n− 1) degrees of freedom. Thus, by

Lecture Note


using the critical point tα/2,n−1 we obtain

P

(∣∣∣∣X − µ

S/√n

∣∣∣∣ < tα/2,n−1

)= P

(X −

tα/2,n−1S√n

< µ < X +tα/2,n−1S√

n

)= 1− α.

This implies that the parameter µ is in the interval(X −

tα/2,n−1S√n

, X +tα/2,n−1S√

n

)with probability (1− α). The interval is also known as the t-interval.

Example. A random sample of n milk containers is selected, and their milk contents are weighed. The data

X1, . . . , Xn (1)

can be used to investigate the unknown population mean of the milk container weights. The random selection ofsample should ensure that the above data can be assumed to be iid. Suppose that we have calculated X = 2.073and S = 0.071 from the actual data with n = 30. Then by choosing α = 0.05, we have the critical pointt0.025,29 = 2.045, and therefore, obtain the confidence interval(

2.073− 2.045× 0.071√30

, 2.073 +2.045× 0.071√

30

)= (2.046, 2.100)

of level 0.95 (or, of level 95%).

Even if the data (1) are not normally distributed, the central limit theorem says that the estimate X is approx-imately distributed as N(µ, σ2/n). In either case it is sensible to use critical points from the t-distribution.

4.2 Population proportion

Let X1, . . . , Xn be iid Bernoulli random variables with success probability p. The sample mean X is the MLE

of the parameter p, and unbiased. By the central limit theorem, the random variableX − p√p(1− p)/n

has approx-

imately N(0, 1) as n gets larger (at least np > 5 and n(1− p) > 5 by rule of thumb). Here we define the criticalpoint zα for standard normal distribution by P (X > zα) = α with standard normal random variable X. Thus,we have

P

(∣∣∣∣∣ X − p√p(1− p)/n

∣∣∣∣∣ < zα/2

)= P

(X − zα/2

√p(1− p)

n< p < X + zα/2

√p(1− p)

n

)≈ 1− α.

Here we can use√

X(1−X)n as an estimate for

√p(1−p)

n . (We will see later that it is the MLE via the invarianceproperty since X is the MLE of p.) Together we obtain the confidence interval(

X − zα/2

√X(1− X)

n, X + zα/2

√X(1− X)

n

)of level (1− α).

There is an alternative and possibly more accurate method to derive a confidence interval. Here we observe that

P

(∣∣∣∣∣ X − p√p(1− p)/n

∣∣∣∣∣ < zα/2

)= P

((n+ z2

α/2

)p2 − 2

(nX + z2

α/2/2)p+ nX2 ≤ 0

)≈ 1− α.

This implies that the parameter p is in the interval (p−, p+) with probability (1− α), where

p± =nX + z2

α/2/2± zα/2

√nX(1− X) + z2

α/2/4

n+ z2α/2

.

Lecture Note


5 Exploratory Data Analysis

The data values recorded x1, . . . , xn are typically considered as the observed values of random variablesX1, . . . , Xn

having a common probability distribution f(x). To judge the quality of data, it is useful to envisage a populationfrom which the sample should be drawn. A random sample is chosen at random from the population to ensurethat the sample is representative of the population. Once a data set has been collected, it is useful to find aninformative way of presenting it. Graphical representations of data in various forms can be quite informative.

5.1 Relative frequency histogram

Given the number of observations fi, called frequency, in the i-th interval, the height hi of the i-th rectangleabove the i-th interval is represented by

hi =fi

n× (width of the i-th interval).

When the width of each interval is equally chosen, the width w is called bandwidth and the height hi becomes

hi =fi

n× w.

Lecture Note


5.2 Stem and leaf plot

This is much like a histogram except it portrays a data set itself.

Data set20.5 20.7 20.8 21.0 21.0 21.421.5 22.0 22.1 22.5 22.6 22.622.7 22.7 22.9 22.9 23.1 23.323.4 23.5 23.6 23.6 23.6 23.924.1 24.3 24.5 24.5 24.8 24.824.9 24.9 25.1 25.1 25.2 25.625.8 25.9 26.1 26.7

Stem-and-leaf20 | 57821 | 004522 | 01566779923 | 1345666924 | 1355889925 | 11268926 | 17

5.3 Boxplot

The sample median is the value of the “middle” data point. When there is an odd number of numbers, themedian is simply the middle number. For example, the median of 2, 4, and 7 is 4. When there is an even numberof numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12is (4+7)/2 = 5.5. The 25-sample percentile is the value indicating that 25% of the observations takes valuessmaller than the value. Similarly, we can define 50-percentile, 75-percentile, and so on. Note that 50-percentileis the median. We call 25-percentile the lower sample quartile and 75-percentile the upper sample quartile.

A box is drawn stretching from the lower sample quartile (the 25-percentile) to the upper quartile (the 75-percentile). The median is shown as a line across the box. Therefore 1/4 of the distribution is between this lineand the right of the box and 1/4 of the distribution is between this line and the left of the box. Vertical lines(dotted), called “whiskers,” stretch out from the ends of the box to the largest and smallest data.

Lecture Note


5.4 Outliers

Graphical presentations can be used to identify “odd-looking” value which does not fit in with the rest of thedata. Such a value is called an outlier. In many cases an outlier is discovered to be a misrecorded data value, orrepresents some special condition that was not in effect when the data were collected.

Lecture Note


Lecture Note


In the above histogram and boxplot, the value in far right appears to be quite separate from the rest of the data,and can be considered to be an outlier.

6 Tests of statistical hypotheses

Suppose that a researcher is interested in whether the new drug works. The process of determining whether theoutcome of the experiment points to “yes” or “no” is called hypothesis testing. A widely used formalization ofthis process is due to Neyman and Pearson. Our hypothesis is then the null hypothesis that the new drug has noeffect —the null hypothesis is often the reverse of what we actually believe, why? Because the researcher hopesto reject the hypothesis and announce that the new drug leads to significant improvements. If the hypothesis isnot rejected, the researcher announces nothing and goes on to a new experiment.

6.1 Hypothesis testing of population mean

Hospital workers are subject to a radiation exposure emanating from the skin of the patient. A researcher isinterested in the plausibility of the statement that the population mean µ of radiation level is µ0 —the researcher’shypothesis. Then the null hypothesis is

H0 : µ = µ0.

Lecture Note


The “opposite” of the null hypothesis, called an alternative hypothesis, becomes

HA : µ 6= µ0.

Thus, the hypothesis testing problem “H0 versus HA” is formed. The problem here is to whether or not to reject“H0 in favor of HA.”

To assess this hypothesis, the radiation levels X1, . . . , Xn are measured from n patients who had been injectedwith a radioactive tracer, and assumed to be independent and normally distributed with the mean µ. Under thenull hypothesis, the random variable

T =X − µ0

S/√n

has the t-distribution with (n− 1) degrees of freedom. Thus, we obtain the exact probability

P(|T | ≥ tα/2,n−1

)= α.

When α is chosen to be a small value (0.05 or 0.01, for example), it is unlikely that the absolute value |T | islarger than the critical point tα/2,n−1. Then we say that the null hypothesis H0 is rejected with significance levelα (or, size α) when the observed value t of T satisfies |t| > tα/2,n−1.

Example. We have µ0 = 5.4 for the hypothesis, and decided to give a test with significance level α = 0.05.Suppose that we have obtained X = 5.145 and S = 0.7524 from the actual data with n = 28. Then we cancompute

T =5.145− 5.40.7524/

√28≈ −1.79.

Since |T | = 1.79 ≤ t0.025,27 = 2.052, the null hypothesis cannot be rejected. Thus, the evidence against the nullhypothesis is not persuasive.

6.2 p-value

The above random variable T is called the t-statistic. Having observed “T = t,” we can calculate the p-value

p∗ = P (|Y | ≥ |t|) = 2× P (Y ≥ |t|),

where the random variable Y has a t-distribution with (n − 1) degrees of freedom. Then we have the relation“p∗ < α ⇔ |t| > tα/2,n−1.” Thus, we reject H0 with significance level α when p∗ < α. In the above example,we can compute the p-value p∗ = 2× P (Y ≥ 1.79) ≈ 0.0847 ≥ 0.05; thus, we cannot reject H0.

6.3 One-sided hypothesis testing

In the same case of hospital workers subject to a radiation exposure, this time the researcher is interested inthe plausibility of the statement that the population mean µ is greater than µ0. Then the hypothesis testingproblem is

H0 : µ ≥ µ0 versus HA : µ < µ0.

1. The same t-statistic T =X − µ0

S/√n

is used as a test statistic. And we reject H0 with significant level α when

you find that t < −tα,n−1 for the observed value t of T .

2. Alternatively we can construct the p-value

p∗ = P (Y ≤ t),

where the random variable Y has a t-distribution with (n− 1) degrees of freedom. Because of the relation“p∗ < α ⇔ t < −tα,n−1,” we can reject H0 with significant level α when p∗ < α.

Lecture Note


Example. We use the same µ0 = 5.4 for the hypothesis and the same significance level α = 0.05, but use theone-sided test. Recall that X = 5.145 and S = 0.7524 were obtained from the data with n = 28.

1. Then we compute

T =5.145− 5.40.7524/

√28≈ −1.79.

Since T = −1.79 < −t0.05,27 = −1.703, the null hypothesis H0 is rejected. Thus, the outcome is statisticallysignificant so that the population mean µ is smaller than 5.4.

2. Alternatively, we can find the p-value p∗ = P (Y ≤ −1.79) ≈ 0.0423 < 0.05; thus, the null hypothesisshould be rejected.

We can also consider the hypothesis testing problem

H0 : µ ≤ µ0 versus HA : µ > µ0.

1. Using the t-statistics T =X − µ0

S/√n

, we can reject H0 with significant level α when the observed value t of

T satisfies t > tα,n−1.

2. Alternatively we can construct the p-value p∗ = P (Y ≥ t) with the random variable Y has a t-distributionwith (n − 1) degrees of freedom. Because of the relation “p∗ < α ⇔ t > tα,n−1,” we can reject H0 whenp∗ < α.

6.4 Summary

When the null hypothesis H0 is rejected, it is reasonable to find out the confidence interval of the populationmean µ. The following table shows the confidence interval we can construct when your null hypothesis is rejected.Here T = X−µ0

S/√

nis the test statistic, and α is the significance level of your choice.

Hypothesis testing When can we reject H0? (1− α)-level confidence interval

H0 : µ = µ0 versus HA : µ 6= µ0. |T | > tα/2,n−1

(X − tα/2,n−1

S√n, X + tα/2,n−1

S√n

)H0 : µ ≤ µ0 versus HA : µ > µ0. T > tα,n−1

(X − tα,n−1

S√n, ∞

)H0 : µ ≥ µ0 versus HA : µ < µ0. T < −tα,n−1

(−∞, X + tα,n−1

S√n

)

6.5 Exercises

1. An experimenter is interested in the hypothesis testing problem

H0 : µ = 3.0mm versus HA : µ 6= 3.0mm,

where µ is the population mean of thickness of glass sheets. Suppose that a sample of n = 21 glass sheetsis obtained and their thicknesses are measured.

(a) For what values of the t-statistic does the experimenter accept the null hypothesis with a size α = 0.10?

(b) For what values of the t-statistic does the experimenter reject the null hypothesis with a size α = 0.01?

Lecture Note


Suppose that the sample mean X = 3.04mm and the sample standard deviation is S = 0.124mm. Is thenull hypothesis accepted or rejected with α = 0.10? With α = 0.01?

2. A machine is set to cut metal plates to a length of 44.350mm. The length of a random sample of 24 metalplates have a sample mean of X = 44.364mm and a sample standard deviation of S = 0.019mm. Is thereany evidence that the machine is miscalibrated?

3. An experimenter is interested in the hypothesis testing problem

H0 : µ ≤ 0.065 versus HA : µ > 0.065

where µ is the population mean of the density of a chemical solution. Suppose that a sample of n = 31bottles of the chemical solution is obtained and their densities are measured.

(a) For what values of the t-statistics does the experimenter accept the null hypothesis with a size α =0.10?

(b) For what values of the t-statistics does the experimenter reject the null hypothesis with a size α = 0.01?

Suppose that the sample mean X = 0.0768 and the sample standard deviation is S = 0.0231. Is the nullhypothesis accepted or rejected with α = 0.10? With α = 0.01?

4. A chocolate bar manufacturer claims that at the time of purchase by a consumer the average age of itsproduct is no more than 120 days. In an experiment to test this claim a random sample of 26 chocolatebars are found to have ages at the time of purchase with a sample mean of X = 122.5 days and a samplestandard deviation of S = 13.4 days. With this information how do you feel about the manufacturer’sclaim?

7 Power of test

We define a function K(θ) of parameter θ by the probability that H0 is rejected given µ = θ.

K(θ) = P (“Reject H0” | µ = θ)

Then K(θ) is called the power function.

7.1 Type I error

What is the probability that we incorrectly reject H0 when it is actually true? Such an error is called type Ierror, and the probability of type I error is exactly the significant level α, as explained in the following:

1. The probability of type I error for the two-sided hypothesis test is given by K(µ0). Then we have K(µ0) =P(|T | ≥ tα/2,n−1

)= α.

2. In one-sided hypothesis test, the probability of type I error is the worst (that is, largest possible) probabilitymaxθ≥µ0

K(θ) of type I error. Given µ = θ, the random variable

X − θ

S/√n

= T − θ − µ0

S/√n

= T − δ

has the t-distribution with (n − 1) degrees of freedom, where δ =θ − µ0

S/√n

. By observing that δ ≥ 0 if

θ ≥ µ0, we obtain

K(θ) = P (T ≤ −tα,n−1) = P (T − δ ≤ −tα,n−1 − δ) ≤ P (T − δ ≤ −tα,n−1) = α.

Thus, we obtain maxθ≥µ0

K(θ) = α.

Lecture Note


7.2 Power of test

What is the probability that we incorrectly accept H0 when it is actually false? Such probability β is called theprobability of type II error. Then the value (1 − β) is known as the power of the test, indicating how correctlywe can reject H0 when it is actually false. Again, consider the case of hospital workers subject to a radiationexposure. Given the current estimate S = s of standard deviation and the current sample size n = n1, the

t-statistic T =X − µ0

S/√n

can be approximated by N(δ, 1) with δ =µ− µ0

s/√n1

.

Example. Suppose that the true population mean is µ = 5.1 (versus the value µ0 = 5.4 in our hypotheses). Thenwe can calculate the power of the test with δ ≈ −2.11 as follows.

1. In the two-sided hypothesis testing, we reject H0 when |T | > t0.025,27 = 2.052. Therefore, the power of thetest is K(5.1) = P (|T | > 2.052 | µ = 5.1) ≈ 0.523

2. In the one-sided hypothesis testing, we reject H0 when T < −t0.05,27 = −1.703. Therefore, the power ofthe test is K(5.1) = P (T < −1.703 | µ = 5.1) ≈ 0.658.

This explains why we could not reject H0 in the two-sided hypothesis testing. Our chance to detect the falsehoodof H0 is only 52%, while we have 66% of the chance in the one-sided hypothesis testing.

7.3 Effect of sample size

For a fixed significance level α of your choice, the power of the test increases as the sample size n increases. Inthe two-sided hypothesis testing discussed above, we could recommend to collect additional data to increase thepower of the test. But how many additional data do we need? Here is one possible way to calculate a desirablesample size n: In the two-sided hypothesis testing, the power (1− β) of the test is approximated by

P (|T | > tα/2,n−1) ≈ P (Y < −tα/2,n−1 − δ) + P (Y > tα/2,n−1 − δ) ≥ P (Y > tα/2,n−1 − |δ|)

with a random variable Y having the t-distribution with (n− 1) degrees of freedom. Given the current estimateS = s of standard deviation and the current sample size n1, we can achieve the power (1 − α/2) of the test byincreasing a total sample size n and consequently satisfying |δ| ≥ 2tα/2,n1−1. In the above example of radiationexposure of hospital workers, such size n can be calculated as

n ≥(

2tα/2,n1−1s

|µ− µ0|

)2

=(

2t0.025,27 × 0.7524|5.1− 5.4|

)2

= 105.9.

8 Comparison of two populations

We often want to compare two populations on the basis of experiment. For example, a researcher wants to testthe effect of his drug on blood pressure. In any treatment, an improvement could have been due to the placeboeffect when the subject believes that he or she has been given an effective treatment. To protect against suchbiases, the study should consider (i) the use of a control group in which the subjects are given a placebo, andan experimental group in which the subjects are treated with the new drug, (ii) the randomization by assigningthe subjects between the control and the exprimental groups randomly, and (iii) a double-blind experiment byconcealing the nature of treatment from the subjects and the person taking measurements. Then it becomes thehypothesis testing problem

H0 : µ1 = µ2 versus HA : µ1 6= µ2.

where µ1 and µ2 are the respective population means of the control and the experimental groups

Lecture Note


As a result of experiment, we typically obtain the measurements

X1, . . . , Xn

of the subjects from the control group, and the measurements

Y1, . . . , Ym

of the subjects from the experimental group. Then it is usually assumed that X1, . . . , Xn and Y1, . . . , Ym areindependent and normally distributed with (µ1, σ

21) and (µ2, σ

22), respectively. Even when they are not normally

distributed, large sample sizes (n,m ≥ 30) ensure that the tests are appropriate via the central limit theorem.

8.1 Pooled variance procedure

Let Sx and Sy be the sample standard deviations constructed from X1, . . . , Xn and Y1, . . . , Ym, respectively.When it is reasonable to assume “σ2

1 = σ22 ,” we can construct the pooled sample variance

S2p =

(n− 1)S2x + (m− 1)S2

y

n+m− 2

The test statistic

T =X − Y

Sp

√1n + 1

m

has the t-distribution with (n+m−2) degrees of freedom under the null hypothesis H0. Thus, we reject the nullhypothesis H0 with significant level α when the observed value t of T satisfies |t| > tα/2,n+m−2. Or, equivalentlywe can compute the p-value

p∗ = 2× P (Y ≥ |t|)

with Y having a t-distribution with (n+m− 2) degrees of freedom, and reject H0 when p∗ < α.

Confidence interval. The following table shows the corresponding confidence interval of the population meandifference µ1 − µ2, when your null hypothesis H0 is rejected.

Hypothesis testing (1− α)-level confidence interval

H0 : µ1 = µ2 vs. HA : µ1 6= µ2.(X − Y − tα/2,n+m−2Sp

√1n + 1

m , X − Y + tα/2,n+m−2Sp

√1n + 1

m

)H0 : µ1 ≤ µ2 vs. HA : µ1 > µ2.

(X − Y − tα,n+m−2Sp

√1n + 1

m , ∞)

H0 : µ1 ≥ µ2 vs. HA : µ1 < µ2.(−∞, X − Y + tα,n+m−2Sp

√1n + 1

m

)Example. Suppose that we consider the significant level α = 0.01, and that we have obtained X = 80.02 andSx = 0.024 from the control group of size n = 13, and Y = 79.98 and Sy = 0.031 from the experimental groupof size m = 8. Here we have assumed that σ2

1 = σ22 . Then we can compute the square root Sp = 0.027 of the

pooled sample variance S2p , and the test statistic

T =80.02− 79.98

0.027√

113 + 1

8

≈ 3.33.

Thus, we can obtain p∗ = 2×P (Y ≥ 3.33) ≈ 0.0035 < 0.01, and reject H0. We conclude that the two populationmeans are significantly different. And the 99% confidence interval for the mean difference is (0.006, 0.074).

Lecture Note


8.2 General procedure

When “σ21 6= σ2

2 ,” under the null hypothesis H0 the test statistic

T =X − Y√S2

x

n + S2y

m

has approximately the t-distribution with ν degree of freedom, where ν is the nearest integer to(S2

x

n + S2y

m

)2

S4x

n2(n−1) + S4y

m2(m−1)

.

Thus, we reject the null hypothesis H0 with significant level α when the observed value t of T satisfies |t| > tα/2,ν .Or, equivalently we can compute the p-value

p∗ = 2× P (Y ≥ |t|)

with Y having a t-distribution with ν degrees of freedom, and reject H0 when p∗ < α.

Confidence interval. The following table shows the corresponding confidence interval of the population meandifference µ1 − µ2, when your null hypothesis H0 is rejected.

Hypothesis testing (1− α)-level confidence interval

H0 : µ1 = µ2 versus HA : µ1 6= µ2.(X − Y − tα/2,ν

√S2

x

n + S2y

m , X − Y + tα/2,ν

√S2

x

n + S2y

m

)H0 : µ1 ≤ µ2 versus HA : µ1 > µ2.

(X − Y − tα/2,ν

√S2

x

n + S2y

m , ∞)

H0 : µ1 ≥ µ2 versus HA : µ1 < µ2.(−∞, X − Y + tα/2,ν

√S2

x

n + S2y

m

)

Example. Suppose that we consider the significant level α = 0.01, and that we have obtained X = 80.02 andSx = 0.024 from the control group of size n = 13, and Y = 79.98 and Sy = 0.031 from the experimental groupof size m = 8 as before. Then the test statistic T ≈ 3.12, and ν = 12. Thus, we can obtain p∗ = 2 × P (Y ≥3.12) ≈ 0.0089 < 0.01, and still reject H0.

9 Inference on proportions

In experiments on pea breeding, Mendel observed the different kinds of seeds obtained by crosses from plantswith round yellow seeds and plants with wrinkled green seeds. Possible types of progeny were: “round yellow”,“wrinkled yellow”, “round green”, and “wrinkled green.” When the data values recorded x1, . . . , xn takes severaltypes, or categories, we call them the categorical data.

9.1 Point estimate

Let X be the number of observations for a particular type in categorical data of size n, and let p be the populationproportion of this type (that is, the probability of occurrence of this type). Then the random variable X has thebinomial distribution with parameter (n, p). And the point estimate of the population proportion p is

p =X

n.

Lecture Note


We can easily see that

E(p) = E

(X

n

)=

1nE(X) = p

Thus, p is an unbiased estimate of p. Furthermore, recall by the central limit theorem that we have approximately

X ∼ N(np, np(1− p))

when n is large. Then the point estimate p is approximately distributed as the normal distribution with parameter(p, p(1−p)

n ).

9.2 Hypothesis test

Suppose that the vaccine can be approved for widespread use if it can be established that the probability p ofserious adverse reaction is less than p0. Then the hypothesis testing problem becomes

H0 : p ≥ p0 versus HA : p < p0. (2)

Let X be the number of participants who suffer an adverse reaction among n participants. Then, the randomvariable X has the binomial distribution with parameter (n, p) and is approximated by the normal distributionwith parameter (np, np(1− p)) when n is large [that is, to satisfy np > 5 and n(1− p) > 5].

Critical point. The critical point of the standard normal distribution, denoted by zα, is defined as the valuesatisfying P (Z > zα) = α where Z is a standard normal random variable. Since the normal distribution issymmetric, it implies that P (Z < −zα) = α.

Testing procedure. When np0 > 5 and n(1− p0) > 5,

Z =X − np0√np0(1− p0)

(3)

is used for the test statistic. Then we can reject H0 in (2) with significance level α if the value z of the teststatistic Z satisfies z < −zα. Equivalently, we can proceed to construct the p-value p∗ = P (X < z) = Φ(z), andreject H0 when p∗ < α. Since the consideration of continuity correction improves the accuracy, the alternativetest statistic

Z =X − np0 + 0.5√np0(1− p0)

may be also used.

Confidence interval. WhenH0 is rejected, we want to further investigate the confidence interval for the populationproportion p which corresponds to the result of hypothesis test. We have the point estimate p = X/n. Then thetwo different formulas (

0,X + z2

α/2 + zα

√X(n−X)/n+ z2

α/4n+ z2

α

)(4)(

0, p+ zα

√p(1− p)

n

)(5)

can be used for the confidence interval of level α. Although Formula (4) is known to be more accurate, Formula (5)may be used in most of our problems since it is easier to calculate.

Example. Suppose that p0 = 0.05 is required, and that the significance level α = 0.5 is chosen. And the studyshows that X = 4 adverse reactions are found out of n = 155 participants. Note that (0.05)(155) = 7.75 > 5and (0.95)(155) = 147.25 > 5. Thus, we have

Z =4− (155)(0.05) + 0.5√

(155)(0.05)(0.95)≈ −1.20 and p∗ = Φ(−1.20) ≈ 0.115

Lecture Note


We can also obtain the point estimate p ≈ 0.0258 and the 95% confidence interval (0, 0.0562) by using (4) [weget (0, 0.0467) if we use (5)]. Since p∗ ≥ 0.05, we cannot reject the null hypothesis. Thus, it is not advisablethat the vaccine be approved as the result of this study.

9.3 Sample size calculations

We always guarantee the possibility of incorrectly rejecting H0 when H0 is true—Type I error, say, to be lessthan 5% of the chance. But, at the same time we sacrifice the power of detecting the falsehood of H0 when H0

is false—power of the test. In order for the hypothesis testing problem

H0 : p ≥ p0 versus HA : p < p0,

to achive the power (1− β) of the test, we need a sample of size

n ≥

(zα

√p0(1− p0) + zβ

√p(1− p)

p− p0

)2

. (6)

In the above vaccine experiment, if the true population mean p is 0.025, then the power of the test is only 0.18.To increase the power of the test at least 0.8, we need the sample size at least n = 464.

9.4 Summary

Possible null hypotheses for the inference on population proportion are “H0 : p = p0”, “H0 : p ≤ p0”, and“H0 : p ≥ p0”. In either case we can use the test statistic Z in (3) if we do not make a “continuity correction.”Then the corresponding testing procedures are summarized in the following table.

Null hypothesis When we reject it (1− α)-level confidence interval

H0 : p = p0 |Z| > zα/2

(p− zα/2

√p(1−p)

n , p+ zα/2

√p(1−p)

n

)H0 : p ≤ p0 Z > zα

(p− zα

√p(1−p)

n , 1)

H0 : p ≥ p0 Z < −zα

(0, p+ zα

√p(1−p)

n

)

For the sample size calculation, use Formula (6) if the null hypothesis is either “H0 : p ≤ p0” or “H0 : p ≥ p0”.When the null hypothesis is “H0 : p = p0”, the sample size n can be computed as

n ≥

(zα/2

√p0(1− p0) + zβ

√p(1− p)

p− p0

)2

.

9.5 Comparison of two proportions

A researcher is interested in whether there is discrimination against women in a university. In terms of statisticsthis is the hypothesis testing problem

H0 : pA ≤ pB versus HA : pA > pB

where pA and pB are the respective population proportions of men and women who are admitted to the university.The researcher decided to collect the data for graduate program in the university. Let X and Y be the respective

Lecture Note


numbers of men and women who are admitted to the graduate school, which are summarized in the followingtable:

Men WomenAdmit X YDeny n−X m− YTotal n m

The test statistic is given by

Z =pA − pB√

p(1− p)(

1n + 1

m

)where pA = X/n and pB = Y/m are the point estimates of pA and pB , and

p =X + Y

n+m

is called a pooled estimate of the common population proportion. Under the null hypothesis, the probability thatZ > zα becomes approximately less than α. Thus, we reject H0 when the observed value z of Z satisfies z > zα.Or, equivalently we can reject p∗ = 1− Φ(z) < α.

Confidence interval. We may want to further investigate the confidence interval for the difference pA − pB .Having constructed the hypothesis testing problems “H0 : pA = pB”, “H0 : pA ≤ pB”, or “H0 : pA ≥ pB”, thefollowing table shows the corresponding testing procedure and the confidence interval.

Null hypothesis When we reject it (1− α)-level confidence interval for pA − pB

H0 : pA = pB |z| > zα/2

[That is, p∗ = 2× (1− Φ(|z|)) < α]

(pA − pB − zα/2

√pA(1−pA)

n + pB(1−pB)m ,

pA − pB + zα/2

√pA(1−pA)

n + pB(1−pB)m ,

)H0 : pA ≤ pB z > zα

[That is, p∗ = 1− Φ(z) < α]

(pA − pB − zα

√pA(1−pA)

n + pB(1−pB)m , 1

)H0 : pA ≥ pB z < −zα

[That is, p∗ = Φ(z) < α]

(−1, pA − pB + zα

√pA(1−pA)

n + pB(1−pB)m

)

Example. The following table classifies the applications for the graduate school according to admission statusand sex.

Men Women TotalAdmit 97 40 137Deny 263 42 305Total 360 82 442

Then we have pA = 97/360 ≈ 0.269, pB = 40/82 ≈ 0.488, and p = 137/442 ≈ 0.310. And we can obtain

Z =0.269− 0.488√

(0.31)(0.69)(1/360 + 1/82)≈ −3.87 and p∗ = 1− Φ(−3.87) ≈ 0.9999

Thus, we cannot reject H0, indicating that there is no discrimination against women in this particular graduateprogram. In fact, the null hypothesis “H0 : pA ≥ pB” will be rejected in this example.

Lecture Note


10 Chi-square test

In the experiment on pea breeding, Mendel observed the different kinds of seeds obtained by crosses fromplants with round yellow seeds and plants with wrinkled green seeds. Possible types of progeny were: “roundyellow”, “wrinkled yellow”, “round green”, and “wrinkled green.” And Mendel’s theory predicted the associatedprobabilities of occurrence as follows.

Round yellow Wrinkled yellow Round green Wrinkled greenProbabilities 9/16 3/16 3/16 1/16

We want to test whether the data from n observation is consistent with his theory—goodness of fit test, in whichthe statement of null hypothesis becomes “the model is valid.”

10.1 Chi-square test

In general, each observation is classified into one of k categories or “cells,” which results in the cell frequencies

X1, . . . , Xk.

The goodness of fit to a particular model can be assessed by comparing the observed cell frequencies X1, . . . , Xk

with the expected cell frequenciesE1, . . . , Ek,

which are predicted from the model. The discrepancy between the data and the model can be measured by thePearson’s chi-square statistic

χ2 =k∑

i=1

(Xi − Ei)2

Ei.

Under the null hypothesis (that is, assuming that the model is correct), the distribution of Pearson’s chi-squareχ2 is approximated by the chi-square distribution with

df = (number of cells)− 1− (number of parameters in the model)

degrees of freedom. Therefore, if you observe that χ2 = x and x > χ2α,df , then we can reject the null hypothesis,

casting doubt on the validity of the model. Or, by computing the p-value

p∗ = P (X > x)

with a random variable X having the chi-square distribution with df degrees of freedom, equivalently we canreject the null hypothesis when p∗ < α.

Example. In the experiment of pea breeding, we have obtained the data as in the following table.

Round yellow Wrinkled yellow Round green Wrinkled greenFrequencies 315 101 108 32

With the total number of observations n = 556, the expected cell frequencies from the Mendel’s theory can becalculated as

Round yellow Wrinkled yellow Round green Wrinkled greenExpected frequencies 312.75 104.25 104.25 34.75

We can compute the Pearson’s chi-square χ2 = 0.47. Since the Mendel’s model has no parameter, the chi-squaredistribution has 3 = (4− 1) degrees of freedom and we get the p-value p∗ = 0.925. Thus, there is little reason todoubt the Mendel’s theory on the basis of Pearson’s chi-square test.

Lecture Note


10.2 Test of independence

Consider again the study of discrimination against women in university admission. In the study, there are twocharacteristics: men or women; admitted or denied. The researcher wanted to know whether such characteristicsare linked or independent. For such a study, we take a random sample of size n from the population, which issummarized in the contingency table

Men Women TotalAdmit X11 X12 X1·Deny X21 X22 X2·Total X·1 X·2 n = X··

The statement of null hypothesis becomes “the two characteristics are independent.” Under the null hypothesis,the expected frequencies for the contingency table can be given by

Men Women TotalAdmit np1q1 np1q2 np1

Deny np2q1 np2q2 np2

Total nq1 nq2 n

The point estimates of p1, p2, q1, and q2 are p1 = X1·/n, p2 = X2·/n, q1 = X·1/n, and q2 = X·2/n. With thesepoint estimates, the chi-square statistic is

χ2 =2∑

i=1

2∑j=1

(Xij −Xi·X·j/n)2

Xi·X·j/n,

and the degree of freedom is (4− 1− 2) = 1.

Example. By using the same data as before, we can obtain the chi-square statistic

χ2 =[97− (137)(360)/(442)]2

(137)(360)/(442)+

[40− (137)(82)/(442)]2

(137)(82)/(442)

+[263− (305)(360)/(442)]2

(305)(360)/(442)+

[42− (305)(82)/(442)]2

(305)(82)/(442)≈ 14.89,

and the p-value p∗ = 0.0001. Thus, the null hypothesis is rejected at any reasonable level, indicating that thetwo characteristics are somewhat dependent.

11 Minimum variance unbiased estimator.

One of the important attributes of point estimate is unbiasedness. Since a statistic u(X) is a random variable,we can consider the expectation E[u(X)]. Then the point estimate u(X) of θ is called an unbiased estimator if itsatisfies E[u(X)] = θ. For example, the sample mean X is an unbiased estimate of the mean µ, since E(X) = µ.Furthermore, u(X) is called an minimum variance unbiased estimator if (i) it is unbiased, and (ii) for everyother unbiased estimator s(X) for θ we have

R(θ, u) = Var(u(X)) ≤ Var(s(X)) = R(θ, s) for all θ.

11.1 Sufficient statistics

Let f(x; θ) be a joint density function of x = (x1, . . . , xn) for a random vector X = (X1, . . . , Xn). Note thatf(x1, . . . , xn; θ) is of the form f(x1; θ) · · · f(xn; θ) if (X1, . . . , Xn) is a random sample. Suppose that s(X) is a

Lecture Note


statistic having the pdf g(s; θ). Then s(X) is called a sufficient statistic if

f(x; θ)g(s(x); θ)

is a function of x and does not depend on θ.

Factorization theorem. s(X) is a sufficient statistic if and only if the joint density f(x; θ) can be expressedin a form of

f(x; θ) = k(s(x); θ) · h(x).

Furthermore, the pdf g(s; θ) for the statistic s(X) is proportional to k(s; θ), or possibly a multiple c(s)× k(s; θ)with a function c(s) of s.

11.2 Rao-Blackwell’s theorem

Let X = (X1, . . . , Xn) be a random sample from f(x; θ). Suppose that s(X) is a sufficient statistic, and thatu(X) is an unbiased estimator for θ. Then we can construct a new statistic

ϕ(s(X)) = E(u(X) | s(X)),

which is a function of the sufficient statistic s(X). Furthermore, we obtain E[ϕ(s(X))] = θ by the law of totalprobability, and

Var(ϕ(s(X))) ≤ Var(u(X))

via the conditional variance formula.

11.3 Lehmann-Scheffe’s theorem

Let s(X) be a statistic having the pdf g(s; θ). Then s(X) is called a complete statistic if for any function φ,

E[φ(s(X))] = 0 for all θ

suffices to imply that φ(s(X)) ≡ 0.

Lehmann-Scheffe’s theorem. Suppose that s(X) is a complete and sufficient statistic. Then the statistic

ϕ∗(s(X)) = E(u(X) | s(X))

is well-defined with any choice of unbiased statistic u(X). Moreover, ϕ∗(s(X)) is the minimum variance unbiasedestimator for θ, which is unique among functions of s(X).

11.4 Exponential families

Let A be an interval (or a subset) on R, and let

f(x; θ) = exp [c(θ)k(x) + h(x) + d(θ)] , x ∈ A; otherwise, f(x; θ) = 0 for all x 6∈ A. (7)

be a probability density function with parameter θ. Here an interval A can be (−∞,∞), [0,∞), or [0, 1], or asubset A can be 0, 1, . . . or 0, . . . , n, for example; but A should not depend on the parameter θ. Then wesay that the pdf f(x; θ) is of one-parameter exponential family. For example,

1. an exponential density,

Lecture Note


2. a normal density with known σ2,

3. a Poisson frequency function

are of the one-parameter exponential family.

Natural sufficient statistics and completeness. Let X1, . . . , Xn be a random sample from (??). Then thejoint density

f(x; θ) = exp

[c(θ)

n∑i=1

k(xi) +n∑

i=1

h(xi) + nd(θ)

], x ∈ An

is of the exponential family. By the factorization theorem, the statistic s(X) =∑n

i=1 k(Xi) is sufficient, whichwe call a natural sufficient statistic. Furthermore, the statistic s(X) has the pdf of the exponential family

f(s; θ) = exp [c(θ)s+ h∗(s) + nd(θ)] , s ∈ A∗,

and is complete.

12 Efficient estimator

We call a statistic u(X) efficient, if u(X) achieves the Cramer-Rao lower bound.

12.1 Fisher information

Let f(x; θ) be a joint density function for a random sample X = (X1, . . . , Xn). Furthermore, we assume that(i) A = x : f(x; θ) > 0 does not depend on θ, and (ii) ∂

∂θ ln f(x; θ) exists; we will call the conditions in (i)–(ii)the regularity assumptions. For example, a pdf of exponential family satisfies the regularity assumptions. Byobserving that E

[∂∂θ ln f(X; θ)

]= 0, we can define the Fisher information I(θ) by

I(θ) = E

[(∂

∂θln f(X; θ)

)2]

= Var(∂

∂θln f(X; θ)

).

Now suppose that f(x; θ) is of the form f(x1; θ) · · · f(xn; θ). By setting I1(θ) = Var(

∂∂θ ln f(X1; θ)

), we obtain

I(θ) = n×Var(∂

∂θln f(X1; θ)

)= nI1(θ).

Exercise. Show that I(θ) = −E[

∂2

∂θ2 ln f(X; θ)].

12.2 Cramer-Rao lower bound.

Let X and Y be random variables, and let a be a real number. Then we have Var(aX − Y ) = a2Var(X) −2aCov(X,Y )+Var(Y ) ≥ 0. By substituting a = Cov(X,Y ) /Var(X) , we can find the Cauchy-Schwarz inequality

[Cov(X,Y )]2 ≤ Var(X) ·Var(Y ).

Let u(X) be a statistic. Then E[u(X)] is a function of θ, say ψ(θ) = E[u(X)]. Here we can show that

ψ′(θ) = Cov(u(X),

∂

∂θln f(X; θ)

).

Lecture Note


By applying the Cauchy-Schwarz inequality we obtain the Cramer-Rao inequality

Var(u(X)) ≥ [ψ′(θ)]2

I(θ). (8)

When u(X) is an unbiased statistic, the right-hand side of (8) becomes 1 /I(θ) , and is called the Cramer-Raolower bound.

If a statistic u(X) achieves the Cramer-Rao lower bound, we call u(X) an efficient estimator. Clearly anefficient and unbiased statistic is a minimum variance unbiased estimator. Now let f(x; θ) be a joint densitywhich satisfies the regularity assumptions, and let u(X) be an unbiased statistic for θ. If the joint density is ofthe exponential family

f(x; θ) = exp [c(θ)u(x) + h(x) + d(θ)] , x ∈ A∗; otherwise, f(x; θ) = 0 for all x 6∈ A∗, (9)

then the natural sufficient statistic u(X) is an efficient estimator. Conversely, if u(X) is an efficient estimator,then the joint density f(x; θ) is given in the form (9).

Remark. The notion of efficient statistic is somewhat stronger than that of minimum variance. Even worse, thereexist minimum variance unbiased estimators which do not achieve their respective Cramer-Rao lower bound.

13 Hypothesis testing

Let θ be a parameter of an underlying probability density function f(x; θ) for a certain population. The hypothesis“H0 : θ = θ0” is called a simple hypothesis, since it completely specifies the underlying distribution. Whereas,the hypothesis “H0 : θ ∈ Θ0” with a set Θ0 of parameters is called a composite hypothesis if the set Θ0 containsmore than one element. The “opposite” of the null hypothesis is called an alternative hypothesis, and is similarlyexpressed as “HA : θ ∈ Θ1” where Θ1 is another set of parameters such that Θ0∩Θ1 = ∅. The set Θ1 is typically(but not necessarily) chosen to be the complement of Θ0. Thus, the hypothesis testing problem “H0 versus HA”can be formed as

H0 : θ ∈ Θ0 versus HA : θ ∈ Θ1. (10)The problem stated above is to whether or not to reject “H0 in favor of HA.”

13.1 Test statistic

Given a random sample X = (X1, . . . , Xn), a function

δ(X) =

1 if H0 is rejected;0 otherwise.

(11)

is called a test function. Given the test (11), we can define the power function by

K(θ0) = P (“Reject H0” | θ = θ0) = E (δ(X) | θ = θ0) .

A typical test, however, is presented in the form “H0 is rejected if T (X) ≥ c.” Here T (X) is called a teststatistic, and c is called a critical value. Then the test function can be expressed as

δ(X) =

1 T (X) ≥ c;0 otherwise.

(12)

Thus, we obtain K(θ0) = P (T (X) ≥ c | θ = θ0). The probability of type I error (i.e., “H0 is incorrectly rejectedwhen H0 is true”) is defined by

α = supθ0∈Θ0

K(θ0),

which is also known as the size of the test. Having calculated the size α of the test, (11) or (12) is said to be alevel α test, or a test with significant level α.

Lecture Note


13.2 Uniformly most powerful test

What is the probability that we incorrectly accept H0 when it is actually false? Such probability β is called theprobability of type II error. Then the value (1 − β) is known as the power of the test, indicating how correctlywe can reject H0 when it is actually false. Suppose that H0 is in fact false, say θ = θ1 for some θ1 ∈ Θ1. Thenthe power of the test is calculated by K(θ1).

Suppose that the test (11) has the size α. This test is said to be uniformly most powerful if it satisfies

K(θ1) ≥ K ′(θ1) for all θ1 ∈ Θ1

for the power function K ′ of every other level α test. Furthermore, if this is given in the form (12) with teststatistic T (X), then the test statistic T (X) is said to be optimal.

14 Likelihood ratio test

Consider the testing problem with simple (null and alternative) hypotheses:

H0 : θ = θ0 versus HA : θ = θ1.

Let L(θ;x) =∏n

i=1 f(xi; θ) be the likelihood function, and let

L(θ0, θ1;x) =L(θ1,x)L(θ0,x)

be the likelihood ratio. Then

δ(X) =

1 L(θ0, θ1;X) ≥ c;0 otherwise.

(13)

becomes a uniformly most powerful test, and is called the Neyman–Pearson test.

The test function (13) has the following property: For any function ψ(x) satisfying 0 ≤ ψ(x) ≤ 1,

E(ψ(X) | θ = θ1)− E(δ(X) | θ = θ1) ≤ c[E(ψ(X) | θ = θ0)− E(δ(X) | θ = θ0)].

14.1 Monotone likelihood ratio family

Let f(x; θ) be a joint density function with parameter θ, and let L(θ0, θ1;x) be the likelihood ratio. Supposethat T (X) is a statistic and does not depend on the parameter θ. Then f(x; θ) is called a monotone likelihoodratio family in T (X) if

1. f(x; θ0) and f(x; θ1) are distinct for θ0 6= θ1;

2. L(θ0, θ1;x) is a non-decreasing function of T (x) whenever θ0 < θ1.

Now consider the following test problem:

H0 : θ ≤ θ0 (or H0 : θ = θ0) versus HA : θ > θ0. (14)

If f(x; θ) is a monotone likelihood ratio family in T (X), then the test functions (12) and (13) is equivalentwhenever θ0 < θ1, and the power function K(θ) for these tests becomes an increasing function. Furthermore,T (X) is an optimal test statistic, and the size of the test is simply given by α = K(θ0).

Suppose that f(x; θ) is of the exponential family f(x; θ) = exp [c(θ)u(x) + h(x) + d(θ)], x ∈ A∗, and that c(θ)is a strictly increasing function. Then f(x; θ) is a monotone likelihood ratio family in u(X). And the naturalsufficient statistic u(X) becomes an optimal test statistic.

Remark. Essentially, uniformly most powerful tests exist only for the test problem (14).

Lecture Note


14.2 Test procedure

The Neyman–Pearson test (13) can be generalized for the composite hypotheses in (10): (i) obtain the maximumlikelihood estimate (MLE) θ of θ, (ii) calculate also the MLE θ0 restricted for θ ∈ Θ0, and (iii) construct thelikelihood ratio

λ(X) =L(θ;X)

L(θ0;X)=

supθL(θ;X)

supθ∈Θ0

L(θ;X)= max

supθ∈Θ1

L(θ;X)

supθ∈Θ0

L(θ;X), 1

.

The test statistic λ(X) yields an excellent test procedure in many practical applications, though it is not anoptimal test in general.

15 Bayesian theory

Let f(x; θ) be a density function with parameter θ ∈ Ω. In a Bayesian model the paramter space Ω has adistribution π(θ), called a prior distribution. Furthermore, f(x; θ) is viewed as the conditional distribution of Xgiven θ. By the Bayes’ rule the conditional density π(θ | x) can be derived from

π(θ | x) =

π(θ)f(x; θ)

/∑θ∈Ω

π(θ)f(x; θ) if Ω is discrete;

π(θ)f(x; θ)/∫

Ω

π(θ)f(x; θ) dθ if Ω is continuous.

The distribution π(θ | x) is called the posterior distribution. Whether Ω is discrete or continuous, the posteriordistribution π(θ | x) is “proportional” to π(θ)f(x; θ) up to the constant. Thus, we write

π(θ | x) ∝ π(θ)f(x; θ).

15.1 Conjugate family

It is often the case that both the prior density function π(θ) and the posterior density function π(θ | x) belongto the same family of density function π(θ; η) with parameter η. Then π(θ; η) is called conjugate to f(x; θ).

Let

f(x; θ) = exp

nc0(θ) +m∑

j=1

cj(θ)kj(x) + h(x)

;

π(θ; η0, η1, . . . , ηm) = exp

c0(θ)η0 +m∑

j=1

cj(θ)ηj + w(η0, η1, . . . , ηm)

.Suppose that a prior distribution is given by π(θ) = π(θ; η0, η1, . . . , ηm). Then we obtain the posterior density

π(θ | x) = π(θ; η0 + n, η1 + k1(x), . . . , ηm + km(x)).

Thus, the family of π(θ; η0, η1, . . . , ηm) is conjugate to f(x; θ).

Lecture Note


15.2 Decision model

Given a random sample X from f(x; θ), we can introduce a decision function δ(X), and incur a loss l(θ, δ(X))associated with the state of θ. Together we calculate the risk by

R(θ, δ) =∫l(θ, δ(x))f(x; θ) dx.

If the decision function δ is strictly dominated by no other decision function δ′, that is, if no decision function δ′

satisfies R(θ, δ′) ≤ R(θ, δ) for all θ ∈ Ω with strict inequality for some θ ∈ Ω, then δ is called admissible.

In the Bayesian model where the parameter θ has the prior distribution π(θ), we can define the Bayes risk by

r(δ) =∫

Ω

R(θ, δ)π(θ) dθ.

Then the decision function δ∗ is called a Bayes solution if δ∗ minimizes the Bayes risk r(δ). When the parameterspace Ω is an interval on R and π(θ) > 0 and R(θ, δ) is continuous at every point θ and every decision functionδ, the Bayes solution δ∗ is admissible.

Having observed X = x, we can compute the posterior density π(θ | x), and construct the posterior risk by

r(δ(x) | x) =∫

Ω

l(θ, δ(x))π(θ | x) dθ.

If there is a decision function δ0 such that δ0(x) minimizes the posterior risk r(δ(x) | x) for every x, then δ0 isa Bayes solution.

Lecture Note

mathematical statistics - tennessee techmath.tntech.edu/isr/mathematical_statistics/thispage.pdf ·...

Documents