introducing probability and statisticssta6ajb/notes-sb.pdf · 2015. 9. 23. · introducing...

Introducing Probability and Statistics:

A concise course on the fundamentals of statistics

by Dr Robert G AykroydDepartment of Statistics, University of Leeds

c©RG Aykroyd and University of Leeds, 2014. Section 4 produced in collaboration with S Barber.

Introducing Probability and Statistics

“Statistics is a mathematical science pertaining to the collection, analysis, interpretation or expla-nation, and presentation of data. It is applicable to a wide variety of academic disciplines, from thenatural and social sciences to the humanities, government and business.Statistical methods can be used to summarize or describe a collection of data; this is called de-scriptive statistics. In addition, patterns in the data may be modelled in a way that accounts forrandomness and uncertainty in the observations, and are then used to draw inferences about the pro-cess or population being studied; this is called inferential statistics. Both descriptive and inferentialstatistics comprise applied statistics. There is also a discipline called mathematical statistics, whichis concerned with the theoretical basis of the subject.”

(Source: http://en.wikipedia.org/wiki/Statistics)

This short course aims to give a quick reminder of many basic ideas in probability and statistics. Thematerial is selected from an undergraduate module on mathematical statistics, and hence emphasisesthe “theoretical basis” which underpins applied statistics. The topics covered are a mix of practicalmethods and mathematical foundations. If you are familiar with most of these ideas, then you arewell prepared for your studies. If, on the other hand, you find some of the topics new then please takesome extra time to understand the ideas and complete the exercises.

Outline of the course:

1. BASIC PROBABILITY. Events, sample space and the axioms. Random variables.Expectation and variance.

1

2. CONDITIONAL PROBABILITY. Conditional probability and independence. Expec-tation and variance. Total probability and Bayes Theorem.

4

3. STANDARD DISTRIBUTIONS. Binomial, Poisson, exponential and normal. Mo-ment generating functions. Sampling distributions.

8

4. LINEAR REGRESSION. The linear regression model. Vector form of regression. 135. CLASSICAL ESTIMATION. Method of Moments. Maximum likelihood. Properties

of estimators. Hypothesis Testing. Likelihood ratio test. Exercises.17

6. THE NORMAL DISTRIBUTION. Transformations to normality. Approximationsand the central limit theorem.

22

7. DERIVED DISTRIBUTIONS. Function of random variables. Sums of independentvariables. Student-t, Chi-squared and F distributions. Exercises.

24

8. BAYESIAN ESTIMATION. Subjective probability and expert opinion. Definitions ofprior, likelihood and posterior. Posterior estimation. Exercises.

30

Practical Exercises. 35Solutions to Practical Exercises. 37Solutions to Theoretical Exercises. 39Standard Distributions and Tables. 50

Useful references:Rice JA, Mathematical Statistics and Data Analysis, Duxbury Press, 2nd Ed, 1995Stirzaker DR, Elementary Probability, CUP, 2003 (online at University library).

INTRODUCING PROBABILITY AND STATISTICS 1 BASIC PROBABILITY

1 Basic Probability

1.1 Introduction

Probability is a branch of mathematics which rigorously describes uncertain (or random) systems andprocesses. It has its roots in the 16th/17th century with the work of Cardano, Fermat and Pascal, butit is also an area of modern development and research. Put simply, probability measures the likeli-hood or chance of some event occurring: probability zero means the event is impossible whereas aprobability of 1 means that the event is certain. The larger the probability, the more likely the event.Applications include: modelling hereditary disease in genetics, pension calculations in actuarial sci-ence, stock pricing in finance, epidemic modelling in public health, and many more!

1.2 Events and axioms

The set of all possible outcomes is the sample space Ω (the Greek letter, capital “omega”), and wemay be interested in the chance of some particular outcome, or event, occurring.

An event, often denoted A,B,C, · · · , is a set of outcomes of an experiment. The set can be empty,A = ∅, giving an impossible event, or it can be equal to the sample space, A = Ω, giving a certainevent. These extremes are not very interesting and so the event will usually be a non-empty, propersubset of the sample space.

Probabilities must satisfy the following simple rules:

The (Kolmogorov) axioms:

K1 Pr(A) ≥ 0 for any event A,

K2 Pr(Ω) = 1 for any sample space Ω,

K3 Pr(A ∪ B) = Pr(A) + Pr(B) for any mutually ex-clusive events A and B (that is when A ∩B = ∅).

Clearly, these are very basic properties but they are sufficient to allow many complex rules to bederived, such as:

The general addition rule:

Pr(A ∪B) = Pr(A) + Pr(B)− Pr(A ∩B).

FURTHER READING: Sections 1.2-1.4 of Stirzaker.

1


1.3 Random variables

Whenever the outcome of a random experiment is a number, then the experiment can be described bya random variable. It is conventional to use capital letters to denote random variables, e.g. X, Y, Z.

The range space of a random variable X , is a set, SX , of all possible values of the random variable,eg SX = a1, a2, ..., ar, ... or SX = [0,∞). A discrete random variable is a random variable with afinite (or countably infinite) range space. A continuous random variable is a random variable with anuncountable range space.

For a discrete random variable, X say, the probability of the random variable taking a particularelement of the range space is Pr(X = ar) (or pX(x)) – this is called the probability mass function.When the random variable, Y say, is continuous we have a function, fY (y), to describe the density ofprobability over the range space – this is called the probability density function.

Alternatively, the probabilities may be summarised by a distribution function defined by

FZ(z) = Pr(Z ≤ z).

For discrete random variables this is obtained by summing the probability mass function, and forcontinuous random variables by integrating the probability density function,

FX(x) =∑r≤x

pX(r) FY (y) =

∫ y

−∞fY (t)dt.

As a consequence of this last result, a probability density function can be obtained from the corre-sponding distribution function by differentiating

fY (y) =d

dy(FY (y)) .

FURTHER READING: Sections 2.1, 2.2 and 15.3.2 of Rice and Section 4.1 of Stirzaker. An interestingdiscussion of randomness can be found at http://en.wikipedia.org/wiki/Randomness

2


1.4 Expectations and variance

The expectation (or mean) of a random variable is defined as:

E [X] =

∑

x xp(x) for discrete X∫xxf(x)dx for continuous X

and the moments (about zero) are defined by:

E [Xr] =

∑

x xrp(x) for discrete X∫

xxrf(x)dx for continuous X

The expectation of a function of a random variable is given by:

E [g(X)] =

∑

x g(x)p(x) for discrete X∫xg(x)f(x)dx for continuous X

The variance

V ar(X) = E[(X − µ)2

]=

∑

x(x− µ)2p(x) (discrete)∫x(x− µ)2f(x)dx (continuous)

where µ = E[X]. It is usually easier, however, to calculate the variance using

V ar(X) = E[X2]− E[X]2

where E [X2] is the expectation of X-squared, i.e. the second moment about zero.

FURTHER READING: Section 4.3 of Stirzaker, and http://en.wikipedia.org/wiki/Expected value

3

INTRODUCING PROBABILITY AND STATISTICS 2 CONDITIONAL PROBABILITY

2 Conditional Probability

2.1 Definitions

For two discrete random variables, X and Y , we have:

JOINT PROBABILITY MASS FUNCTION:

p(x, y) = Pr(X = x, Y = y),

where (i) 0 ≤ p(x, y) ≤ 1, for all x, y, and

(ii)∑x

∑y

p(x, y) = 1.

MARGINAL PROBABILITY MASS FUNCTIONS:

pX(x) =∑y

p(x, y) pY (y) =∑x

p(x, y)

CONDITIONAL PROBABILITY MASS FUNCTIONS:

pX|Y (x|y) =p(x, y)

pY (y)where pY (y) > 0,

pY |X(y|x) =p(x, y)

pX(x)where pX(x) > 0

FURTHER READING: Chapter 3 and Section 4.4 of Rice, and Sections 2.1 and 2.2 of Stirzaker.

4


Continuous case

For two continuous random variables, X and Y , we have:

JOINT PROBABILITY DENSITY FUNCTION:

f(x, y), −∞ < x <∞,−∞ < y <∞,

where (i)f(x, y) ≥ 0, for all x, y, and

(ii)∫ ∞y=−∞

∫ ∞x=−∞

f(x, y) dx dy = 1.

MARGINAL PROBABILITY DENSITY FUNCTIONS:

fX(x) =

∫ ∞y=−∞

f(x, y) dy fY (y) =

∫ ∞x=−∞

f(x, y) dx

CONDITIONAL PROBABILITY DENSITY FUNCTIONS:

fX|Y (x|y) =f(x, y)

fY (y)where fY (y) > 0,

fY |X(y|x) =f(x, y)

fX(x)where fX(x) > 0

2.2 Independent random variables

Two random variables X and Y are independent if and only if

p(x, y) = pX(x)pY (y) for all discrete x, y

f(x, y) = fX(x)fY (y) for all continuous x, y.

FURTHER READING: Sections 5.1-5.3 and 4.4 of Stirzaker.

5


2.3 Expectations and correlation

Consider random variables X and Y , with joint probability density function f(x, y) or joint probabil-ity mass function p(x, y), then for any function h(x, y):

E [h(X, Y )] =

∑

x

∑y h(x, y)p(x, y) for discrete X, Y∫

x

∫yh(x, y)f(x, y)dydx for continuous X, Y.

For example, the (r, s)th moment about zero E [XrY s], has h(x, y) = xrys so that E[XY ] usesh(x, y) = xy. Further, the (r, s)th moment about the mean, given by E [(X − µX)r(Y − µY )s] hash(x, y) = (x− µX)r(y − µY )s where µX = E[X] and µY = E[Y ].

Then the correlation of X and Y can be found as:

Corr (X, Y )) =Cov(X, Y )√V ar(X)V ar(Y )

where Cov(X, Y ) = E [(X − µX)(Y − µY )] is the covariance of X and Y . If X and Y are indepen-dent, then the covariance is zero and hence the correlation is zero – whenever the correlation is zero,then the variables are said to be uncorrelated. Note, however, that in general uncorrelated does notmean the variables are independent.

Given random variables X and Y , the conditional expectation of Y given that X = x is:

E [Y |X = x] =

∑

y y pY |X(Y = y|X = x) (discrete)∫yy fY |X(y|x)dy (continuous).

Clearly, in either of these definitions the conditional distribution could be replaced by the ratio of jointdistribution to the appropriate marginal distribution.

For any function h(Y ), the conditional expectation of h(Y ) given X = x is given by

E [h(Y )|X = x] =

∑

y h(y)pY |X(Y = y|X = x) (discrete)∫yh(y)fY |X(y|x)dy (continuous).

6


2.4 Total probability and Bayes Theorem

Suppose that we are interested in the probability of some event A, but that it is not easy to evaluatep(A) directly. Firstly, let the events B1, B2, . . . , Bk partition the sample space. For B1, B2, . . . , Bk

to be a partition of the sample space Ω, they must be (i) mutually exclusive, that is Bi ∩ Bj = ∅ (fori 6= j) and (ii) exhaustive, that is B1 ∪ B2 ∪ · · · ∪ Bk = Ω. Further suppose that we can easily findp(A|Bj) (for j = 1, . . . , k) then

Total probability rule:

Pr(A) =k∑j=1

Pr(A|Bj)Pr(Bj).

Further, suppose that we have a conditional probability, Pr(A|B) for example, but we are interestedin the probability of the events conditioned the other way, that is Pr(B|A), then

Bayes theorem (1):

Pr(B|A) =Pr(A|B)Pr(B)

Pr(A)when Pr(A) > 0.

In general, if B1, B2, . . . , Bk is a partition, as above, and we use the total probability rule, then wecan write

Bayes theorem (2):

Pr(Bi|A) =Pr(A|Bi)Pr(Bi)k∑j=1

Pr(A|Bj)Pr(Bj)

i = 1, . . . , k.

FURTHER READING: Section 2.1 of Stirzaker.

7

INTRODUCING PROBABILITY AND STATISTICS 3 STANDARD DISTRIBUTIONS

3 Standard Distributions

3.1 Example distributions

Binomial distribution, B(n, π)

The binomial distribution can be defined as the number of successes in n independent Bernoulli trialswith two possible outcomes (success and failure) with probabilities π and 1− π.

p(x) =

(n

x

)πx(1−π)n−x, x = 0, 1, ..., n (0 < π < 1).

E[X] = nπ V ar(X) = nπ(1− π)

Poisson distribution, Po(λ)

The Poisson distribution is often used as a model for the number of occurrences of rare events in timeor space, such as radioactive decays.

p(x) =e−λλx

x!, x = 0, 1, ... (λ > 0).

E[X] = λ V ar(X) = λ

Exponential distribution, exp(λ)

The exponential distribution is often used to describe the time between events which occur at random,or to model “lifetimes”. It possesses the so-called “memoryless” property.

f(x) = λe−λx, x ≥ 0 (λ > 0).

E[X] =1

λV ar(X) =

1

λ2

Normal distribution, N(µ, σ2)

The normal (or Gaussian) distribution is the most widely used. It is convenient to use, often fits datawell and can be theoretically justified (via the central limit theorem).

f(x) =1√

2πσ2exp

−1

2

(x− µ)2

σ2

, −∞ < x <∞.

E[X] = µ V ar(X) = σ2

FURTHER READING: Section 4.2, 4.3 and 7.1 of Stirzaker.

8


3.2 Moment generating functions

The moment generating function (mgf) of a random variable X is defined as MX(t) = E[etX ] =∑x e

txpX(x), if discrete;∫xetxfX(x)dx if continuous, and it exists provided the sum or integral

converges in an interval containing t = 0.

1. The mgf is unique to a probability distribution.

2. By considering the (Taylor) power series expansion

MX(t) =∞∑r=0

tr

r!E[Xr],

we see that E[Xr] is the coefficient of tr/r!

3. Moments can easily be found by differentiation

E[Xr] =dr

dtrMX(t)

∣∣∣∣∣t=0

i.e. E[Xr] is the rth derivative of MX(t) with t = 0.

4. If X has mgf MX(t) and Y = aX + b, where a and b are constants, then the mgf of Y is

MY (t) = ebtMX(at)

5. If X and Y are independent random variables with mgfs MX(t) and MY (t) respectively, thenZ = X + Y has mgf given by

MZ(t) = MX(t)MY (t).

Extending this to n independent random variables, Xi, i = 1, 2, ..., n with mgfs Mxi(t), i =

1, 2, ..., n, then the mgf of Z =∑Xi is

MZ(t) = MX1(t)MX2(t)...MXn(t).

If Xi, i = 1, 2, ..., n are independent and identically distributed (i.i.d.) with common mgfMX(t) then

MZ(t) = MX(t)n .

6. If Xn is a sequence of random variables with mgfs MXn(t), and X is a random variablewith mgf MX(t) such that

limn→∞

MXn(t) = MX(t)

then the limiting distribution of Xn is the distribution of X .

9


3.3 Sampling and sampling distributions

The first task of any research project is the design of the investigation. It is important to gather allinformation regarding the problem from historical records and from experts. This allows each part ofthe experimental design, modelling and even analysis to be planned.

The target population is the set of all people, products or things about which we would like to drawconclusions. Typically we will be interested in some particular characteristic of the population, suchas weight or risk associated with a particular financial product. The sample is a, usually small, sub-setof the population and is selected in such a way as to be representative of the population. We willthen use the sample to draw conclusions about the population. The choice of sample size depends onmany factors such as the sampling method, the natural variability, measurement error and the requiredprecision of any estimation or the power of any hypothesis tests.

Suppose we have a random sample of n observations or measurements, x1, . . . , xn, of a randomvariable X . It is very common to summarise the sample using a small number of sample statistics,rather than report the whole sample. The most usual summary statistics are the sample mean and thesample variance x = 1

n

∑xi and s2 = 1

n−1

∑(xi − x)2. Other sample summaries are possible, such

as median and mode as measures of centre or location of the distribution, or range and inter-quartilerange as measures of the spread of the distribution.

As well as numerical statistics, it is common to consider graphical representations. Stem-and-leaf andbox plots can be used to display the numerical summaries and are particularly useful for comparinggeneral properties between samples. Also, histograms can help to choose, or confirm, a probabilitymodel. Numerical statistics are then used to estimate model parameters, for example using the sampleproportion to estimate the probability in the binomial, or the sample mean and variance to estimationto population mean and variance in the normal.

If we were to repeat the sampling process to obtain other datasets, then we would not expect thevarious summary statistics to be unchanged – this is due to sampling variation. We can imagineperforming the sampling many times and looking at the distribution of the summary statistic – this isthe sampling distribution.

Suppose we have a random sample, X1, . . . , Xn, from a normal population with mean µ and varianceσ2. It can be shown that the sampling distribution of the sample mean also has a normal distributionwith mean µ but with variance σ2/n, that is X ∼ N(µ, σ2/n). We can also derived results aboutother distributions. For example a good estimator of the probability, π, in the Binomial distributionis X/n, and that for the Poisson λ = X is a good choice. Notice that each of these is a function ofthe mean and, although, the data are not from a normal distribution we can call on the central limittheorem, if we have a large sample, to justify a normal approximation. That is in the Binomial π isapproximately N(π, π(1− π)/n), and for the Poisson λ approximately follows N(λ, λ/n).

10

INTRODUCING PROBABILITY AND STATISTICS EXERCISES

Exercises

(1.1) Let X be a random variable with probability mass function pX(x) given by

x -3 -1 0 1 2 3 5 8pX(x) 0.1 0.2 0.15 0.2 0.1 0.15 0.05 0.05

Check that pX(x) defines a valid probability distribution, then calculate P (1 ≤ X ≤ 4) andP (X is negative). Evaluate the expected value E[X] and the variance V ar(X).

(1.2) Suppose that X has a PDF

fX(x) = cx(2− x) for 0 ≤ x ≤ 2.

Find the constant c so that this is a valid PDF. Obtain the cumulative distribution functionFX(x), and then find the probability that X > 1.

(2.1) Let X and Y have the joint probability mass function given in the following table.

Value of yx -2 -1 0 11 1/32 3/32 3/32 1/322 1/16 3/16 3/16 1/163 1/32 3/32 3/32 1/32

Find the cumulative distribution function of Y , and the conditional distribution of Y given X .Are X and Y independent?

(2.2) The joint PDF of X and Y is given by

f(x, y) =6

7

(x2 +

xy

2

), 0 ≤ x ≤ 1, 0 ≤ y ≤ 2.

Find the marginal PDFs of X and Y , and then the cumulative distribution function of X . Eval-uate the expectation of X , the expectation of X(X − Y ), and the conditional expectation of Xgiven that Y = 1.

(2.3) A laboratory blood test is 80% effective in detecting a certain disease when it is in fact present.However, the test also yields a ‘false positive’ result for 5% of healthy persons tested. Supposethat 0.4% of the population actually have the disease. What is the probability that a personfound ‘ill’ according to the test does have the disease?

(3.1) An exam paper consists of 20 multiple choice questions with 5 possible answers each (only oneis correct). In order to get a pass mark, it is necessary to give correct answers to at least 20% ofquestions.

(a) A student has decided to answer just by guessing. What is the probability that he wouldpass the exam?

11


(b) Suppose now that the student pursues an “educated guess”, in that he knows enough tobe able to discard two most unlikely answers for each of 20 questions, and will guess atrandom on the remaining two answers. What are his chances to pass the exam now?

(3.2) Suppose that X has an exponential distribution with p.d.f. given by f(x) = λe−λx for x ≥ 0

and f(x) = 0 otherwise, and suppose that λ = 2.

(a) Evaluate the probability Pr(X > 12).

(b) Find the value of x such that FX(x) = 12.

(c) Evaluate the probability Pr(X > 1 | X > 12).

(3.3) Suppose that X has a Poisson distribution with parameter λ, then find the MGF and hence themean and variance of X . Using MGFs, show that the sum of two independent Poisson randomvariables is also a Poisson random variable.

12

INTRODUCING PROBABILITY AND STATISTICS 4 LINEAR REGRESSION

4 Linear regression and least squares estimation

4.1 Introduction

In many sampling situations it is essential to consider related variables simultaneously. Even in sit-uations where there is an exact physical law, measured data will be subject to random fluctuationsand hence fitting a functional relationship to data is a common task. There may be some informationbefore an experiment about the type of relationship expected, and this may have been used in theexperimental design, but it is always wise to visualize the possible relationship using a scatterplot.The most commonly used model is the straight line. This may be due to a physical law which islinear, or as an local approximation to a nonlinear relationship. In other cases there will be no the-oretical justification but it is simply chosen because the data seem to follow a linear pattern. In allcases, it is important to check that this assumption is reasonable both before the analysis, by drawinga scatterplot, and afterwards by performing a residual analysis – these are not covered in this course.

4.2 The linear regression model

Suppose we have a dataset containing n paired values, (xi, yi) : i = 1, . . . , n. Consider the simplelinear regression model

yi = α + βxi + εi i = 1, . . . , n,

where

yi is the response or dependent variable,

α and β are regression parameters,

xi is the independent or explanatory variable, measured without error, and

εi is the random error term and is independent and identically distributed (iid) N(0, σ2).

Consider a general straight line passing through the data plotted as points on a scatterplot. In general,the points will not lie perfectly on the line, but instead there is an error, or residual, associated witheach point. Let the straight line be defined by the equation y = α + βx and the y-value on the linewhen x = xi is denoted yi = α + βxi. Now, since we are assuming that the explanatory variable ismeasured without error, then the residuals are measured only in the y direction. So

ri = yi − yi = yi − (α + βxi) i = 1, . . . , n.

Given observations (xi, yi); i = 1, . . . , n, we estimate the regression parameters by least squares.To do this, we minimise the sum of squared residuals S =

∑i r

2i .

13


Using partial differentiation of S with respect to α and β separately we obtain

∂S

∂α= −2

∑i

(yi − α− βxi)

∂S

∂β= −2

∑i

xi (yi − α− βxi) .

The minimum can be found by solving these equations, giving parameter estimates α and β, when∂S/∂α = 0 and ∂S/∂β = 0, that is when∑

i

yi = nα + β∑i

xi

∑i

xiyi = α∑i

xi + β∑i

x2i

— these are called the normal equations. Dividing the first by n gives y = α + βx, then substitutingfor α in the fitted equation, y = α+ βx, gives y− y = β(x− x). Notice that when x = x then y = y,that is the line passes through the centroid, (x, y), of the data.

Now, dividing the first normal equation by n and re-arranging gives

α =1

n

∑i

yi − β1

n

∑i

xi

which can be substituted into the second normal equation, and after a few steps leads to

∑i

xiyi =∑i

xi∑i

yi/n+ β

(∑i

x2i −

∑i

xi∑i

xi/n

).

Which gives the result, in two alternative forms,

β =

∑ixiyi −

∑ixi∑

iyi/n∑ix

2i −

∑ixi∑

ixi/n=

∑i(xi − x)(yi − y)∑

i(xi − x)2.

Summarizing this, the least-squares estimator are

α = y − βx and β =

∑i(xi − x)(yi − y)∑

i(xi − x)2.

We can also write β = Sxy/Sxx, where we define

Sxx =∑i

(xi − x)2 and Sxy =∑i

(xi − x)(yi − y).

To make predictions of the response, y, corresponding to values of the explanatory variable, x, wesimply substitute into the fitted equation, y = α + βx. Similarly, fitted values, yi, can be calculatedcorresponding to observed values of the explanatory variable, xi by substitution as yi = α + βxi.

14


4.3 Vector form of linear regression

Consider the centred linear regression model where x has been subtracted from all x-values

yi = α′ + β(xi − x) + εi i = 1, . . . , n.

We keep the same notation for the slope parameter but relabel the intercept. Comparing the twoversions we see that α′ = α + β x so α′ = α + β x = (y − βx) + β x = y.

We can write our centred regression model in vector form as y = X θ + ε where

y =

y1

y2

...yn

, X =

1 x1 − x1 x2 − x...

...1 xn − x

, θ =

[α′

β

], and ε =

ε1

ε2...εn

.

Note that we could write the uncentred model in a similar form — we would just have to changethe second column of X . Note also that we could include multiple explanatory variables simply byadding more columns toX and more parameters to θ.

We can estimate θ by least squares. Note that r = y −Xθ and minimise

S =∑i

r2i = rTr

= (y −Xθ)T (y −Xθ)

= yTy − (Xθ)Ty − yTXθ + (Xθ)T (Xθ)

= yTy − 2θTXTy + θTXTXθ.

Differentiating S with respect to θ gives

∂S

∂θ= −2XTy + 2XTXθ,

and setting this to zero gives the set of equations XTXθ = XTy, which defines the least squaresestimators (

α′

β

)= θ = (XTX)−1XTy.

You should check for yourself that this has gives the same parameter estimates as before.

Example: Consider the following data on pullover sales (number of pullovers sold) and price (inEUR) per item. The aim is to discover if the price (explanatory variables) influences the overall sales(response variable).

Sales 230 181 165 150 97 192 181 189 172 170Price 125 99 97 115 120 100 80 90 95 125

15


To investigate the relationship between sales (y) and price (x) we can calculate the following values.

Price, xi x2i Sales, yi xiyi

125 15625 230 2875099 9801 181 1791997 9409 165 16005

115 13225 150 17250120 14400 97 11640100 10000 192 1920080 6400 181 1448090 8100 189 1701095 9025 172 16340

125 15625 170 21250Total 1046 1727 2198.4 179844

(∑xi) (

∑x2i ) (

∑yi) (

∑xiyi)

Fig: Scatterplot of pullover sales againstprice, with fitted equation.

80 90 100 110 120

100

120

140

160

180

200

220

x

yFirst, the means are x = 104.6 and y = 172.7, and then Sxx = 2198.4 and Sxy = −800.2 givingβ = Sxy/Sxx = −800.2/2198.4 = −0.364 (to 3 d.p.) and α = y − βx = 172.7 + 0.364 × 104.6 =

210.774. Hence we have the fitted equation

Sales = 210.774− 0.364× Price.

Alternatively, we can fit our regression model by constructing the matrix representation form, defining

y =

230

181

165

150

97

192

181

189

172

170

and X =

1 20.4

1 −5.6

1 −7.6

1 10.4

1 15.4

1 −4.6

1 −24.6

1 −14.6

1 −9.6

1 20.4

.

Then we can find α′ and β using[α′

β

]= (XTX)−1XTy =

[10.0 0.0

0.0 2198.4

]−1 [1727.0

−800.2

]=

[172.7

−0.364

].

Hence α = α′ − βx = 172.7 + 0.364× 104.6 = 210.774, as before.

16

INTRODUCING PROBABILITY AND STATISTICS 5 ESTIMATION AND TESTING

5 Classical Estimation and Hypothesis Testing

5.1 Introduction

Statistical inference is the process where we attempt to say something about an unknown probabilitymodel based on a set of data which were generated by the model. This inference does not have thestatus of absolute truth, since there will be (infinitely) many probability models which are consistentwith a given set of data. All we can do is to establish that some of these models are plausible, whileothers are implausible.

A common approach is to use a probability model for the data which is completely specified exceptfor the numerical values of a finite number of quantities called parameters. In this chapter we willintroduce methods for making inferences about parameters assuming that the given model is correct.The idea is to use the data, xT = (x1, x2, ..., xn), to make a “good guess” at the numerical value of aparameter, θ.

An ESTIMATE, θ = θ(x) is a numeric value which is a function of the data. An ESTIMATOR is arandom variables, θ(X), which is a function of a random sample XT = (X1, X2, ..., Xn).

5.2 Method of Moments

Assume that the Xi are mutually independent with common p.d.f. f(x; θ1, θ2, ..., θp). Then the rthpopulation moment (about zero) is

E[Xr] = µ′r(θ1, θ2, ..., θp)

and the rth sample momentm′r =

1

n

n∑i=1

xri .

The method of moments estimates θ1, θ2, ..., θp is the solution of the p simultaneous (non-linear)equations

µ′r(θ1, θ2, ..., θp) = m′r, r = 1, 2, ..., p.

This method of estimation has no general optimality properties and sometimes does very badly, butusually provides sensible initial guesses for numerical search procedures.

Example

Let X be an exponential random variable with unknown parameter λ. Now let xi : i = 1, .., n be aset of independent observations of this variable. The first sample moment is the sample mean x, andthe first population moment is the expectation of X , i.e. 1/λ. Hence, we find the method of momentsestimate of λ by solving 1/λ = x, that is λ = 1/x.

FURTHER READING: Sections 8.1 to 8.5 of Rice.

17


5.3 Maximum likelihood estimation

The joint pdf of a set of data can be written as f(x; θ). Think of this as a function of θ for a particulardata set, and define the likelihood function

L(θ) = f(x; θ).

An obvious guess at θ is the value which maximises the likelihood, that is the most plausible valuegiven the data. This value is called the maximum likelihood estimate (mle).

For technical reasons it is usual to work with the log-likelihood, l(θ) = logL(θ) = log f(x; θ).Further note that if the Xi are mutually independent with common pdf f(.) then

l(θ) =n∑i=1

log f(xi; θ).

Maximum likelihood estimation enjoys strong optimality properties (at least in large samples).

Example

Again let X be an exponential random variable with unknown parameter λ. Now let xi : i = 1, .., n

be a set of independent observations of this variable.

The log-likelihood is given byn∑i=1

log f(xi; θ) = n log(λ)− λn∑i=1

xi

To find the maximum, differentiate with respect to λ, set equal to zero and solve. This produces theestimate λ = 1/x. In this case, the m.l.e. is the same as the method of moments estimate.

5.4 Properties

1. The most important property is UNBIASEDNESS. An estimator, θ, is unbiased for, θ, if

E[θ] = θ.

If, for small n, E[θ] 6= θ, but E[θ] → θ as n → ∞ then the estimator is ASYMPTOTICALLY

UNBIASED.

2. An (unbiased) estimator is CONSISTENT if

V ar(θ)→ 0 as n→∞.

3. If we have two (or more) estimators, θ and θ, which are unbiased, then we might choose theone with smallest variance. The EFFICIENCY of θ relative to θ is defined to be

eff(θ, θ) =V ar(θ)

V ar(θ).

18


5.5 Hypothesis Testing

Let X1, . . . , Xn be a random sample from a distribution. The general approach to statistical testing isto consider whether the data are consistent with some stated theory or hypothesis.

A hypothesis is a statement about the true probability model, though usually this only concerns theparameter within some specified family, for example N(µ, 1).

A simple hypothesis specifies a single point value for parameter, for example µ = µ0, whereas acomposite hypothesis specifies a range, or set, of values, for example µ < µ0 or µ 6= µ0.

We usually assume that there are two rival hypotheses:

• Null hypothesis, H0 : µ = µ0 (usually well-defined and simple),

• Alternative hypothesis, H1 : “some statement” which is a competitor to H0.

Note: We do not claim to show that H0 or H1 is true, but only to assess if the data provide sufficientevidence to doubt H0. The null hypothesis usually represents a “bench mark” or a “skeptical stance”,for example “this treatment has no effect on the response”, and we will only reject it if there isoverwhelming evidence against it.

We make a decision whether to accept H0 or to accept H1 (that is reject H0) on the basis of the data.Of course any conclusions is bound to be chancy! There must be a (non-zero) probability of a wrongaction, and this is a major characteristic of a statistical test procedure.

The types of error can be summarised as follows:

Decision H0 True H0 FalseAccept H0 Correct WrongReject H0 Wrong Correct

Then, we define, α, the significance level

α = Pr(Type I Error) = Pr(Reject H0 when H0 is true).

This is considered the most important of the two types of error, and of course we want α to be small.Next consider the other error, and define the probability as β,

β = Pr(Type II Error) = Pr(Accept H0 when H0 is false)

which should be small. Also we define the power function

φ = 1− Pr(Type II error) = 1− β

and clearly we want this to be large – a powerful test. Note that this will be a function of the, unknown,true parameter.

19


5.6 Examples of simple hypothesis tests

Throughout the following examples suppose we have a sample of n observations x1, x2, . . . , xn froma normally distributed population with mean µ and variance σ2.

Example 1: Suppose that we know the population variance, but we do not know the populationmean. From the sample we can estimate the population mean using the sample mean, µ = x. Wemight now wish to test the hypothesis that the population mean is some specified value µ0 comparedto the hypothesis that it is not equal to the specified value. That is null hypothesisH0 : µ = µ0 againstalternative hypothesis H1 : µ 6= µ0.

A suitable test statistic is the (observed) z-value

zobs =x− µ0

σ/√n

which is then compared to the standard normal distribution.

Example 2: Suppose now that the population variance is unknown. For the same hypotheses asabove, H0 : µ = µ0 against H1 : µ 6= µ0, the corresponding test statistic is the (observed) t-value

tobs =x− µ0

sn−1/√n

where sn−1 is the sample standard deviation defined by s2 =∑

(xi − x)2/(n− 1). This is comparedto the (so called) t-distribution with n− 1 degrees of freedom.

5.7 The likelihood ratio test

Consider a random sample X1, . . . , Xn from some distribution with parameter θ and that we wish totest H0 : θ = θ0, against H1 : θ 6= θ0. The likelihood ratio statistic is defined as:

Λ =L(θ0)

L(θ)

where θ is the maximum likelihood estimate of θ. Note that 0 ≤ Λ ≤ 1. If there are other unknownparameters, then these are replaced using the appropriate maximum likelihood estimates. We thenreject the null hypothesis if Λ if less than some specified value Λ0 say. This is intuitive since valuesclose to zero suggests H1 is true, whereas values close to 1 suggests H0 is true. It is usual to workwith the log likelihood-ratio, and we reject if λ = log Λ is close to zero.

Equivalently, subject to some conditions, and n is large, Wilkes’ theorem states that

W = −2 log Λ is approximately χ21 under H0.

We now reject H0 is W is large, and in particular if it is larger than χ21(1− α) for a α100% test.

20


Exercises

(4.1) A study was made on the amount of converted sugar in a fermentation process at various tem-peratures. The data were coded (by subtracting 21.5 degrees centigrade from the temperatures)and recorded as follows:

Sugar remaining after fermentation

Temp., x -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5Sugar, y 8.1 7.8 8.5 9.8 9.5 8.9 8.6 10.2 9.3 9.2 10.5

Why do you think that the data were coded by subtracting 21.5 from the temperatures?

Fit a linear regression model and find the regression equation for your model.

(4.2) Consider the multiple linear regression of a response y on two explanatory variables x and wusing the model

yi = α + β1xi + β2wi + εi i = 1, . . . , n.

Assume that the predictors x andw are already centred, so∑

i xi =∑

iwi = 0. Use the methodof least squares to find α and show that β1 is given by

β1 =

(1− S2

wx

SwwSxx

)−1(SxySxx− SwxSwySwwSxx

).

where Sxy and Sxx are as defined in the notes and

Sww =∑i

(wi − w)2,

Swx =∑i

(wi − w)(xi − x),

and Swy =∑i

(wi − w)(yi − y).

(5.1) In a survey of 320 families with 5 children the number of girls occurred with the followingfrequencies:

Number of girls 0 1 2 3 4 5Frequency 8 40 88 110 56 18

Explain why the binomial distribution might be a suitable model for this data and clearly stateany assumptions. Derive the equation for the maximum likelihood estimator of p, the probabil-ity of a girl, and then estimate the value using the data.

(5.2) To study a particular currency exchange rate it is possible to model the daily change in logexchange rate by a normal distribution. Suppose the following is a random sample of 10 suchvalues:

0.05 0.29 0.39 -0.18 0.11 0.15 0.35 0.28 -0.17 0.07

Use these data to perform a 5% hypothesis test that the mean change is equal to zero.

21

INTRODUCING PROBABILITY AND STATISTICS 6 THE NORMAL DISTRIBUTION

6 The Normal Distribution

6.1 Introduction

Many statistical method of constructing confidence intervals and hypothesis tests are based on anassumption of normality. The assumption of normality often leads to procedures which are simple,mathematically tractable, and powerful compared to corresponding approaches which do not makethe normality assumption. When dealing with large samples results such as the central limit theoremgive us confidence that small departures are unlikely to be important. With small samples or whenthere are substantial violations of a normality assumption the chances of misinterpreting the data anddrawing incorrect conclusions seriously increases. Because of this we must carefully consider allassumptions throughout data analysis.

Once data have been collected it is important to check modelling assumptions. There are severalways to tell whether a dataset is substantially non-normal such as calculation and testing of skewand kurtosis, and examination of histograms and probability plots. Histograms “approximate” thetrue probability distribution but can be greatly effected by choice of histograms bins etc. Another ap-proach is to consider the probability plot (or Quantile-Quantile plot) where the expected or theoreticalquantiles are plotted against the sample quantiles. If the model is a good fit to the data, then the pointsshould form a straight line – departures from the line indicate departures from the model.

6.2 Definitions

Suppose that random variable X follows a normal distribution with mean µ and variance σ2 then ithas probability density function

f(x) =1√

2πσ2exp

−1

2

(x− µ)2

σ2

, −∞ < x <∞.

and we might use the shorthand notation X ∼ N(µ, σ2). The cumulative distribution function (CDF)cannot be evaluated as an explicit equation but must be evaluated numerical.

The normal density is unimodal (with mode at its mean) and symmetric about its mean. Hence itsmean, median and mode are all equal. We can say that E[X] = µ and Var(X) = σ2, but also thatthe coefficient of skew, E[(X − µ)3/σ3] = 0, that it is symmetric, also that the coefficient of kurtosisE[(X − µ)4/σ4] = 3 (the excess kurtosis is defined as zero).

If normal random variable Z has mean equal to zero and variance equal to one, then we have thestandard normal distribution, Z ∼ N(0, 1). The PDF is sometimes given the notation f(z) = φ(z)

and the CDF F (z) = Φ(Z). Note that if X ∼ N(µ, σ2) then (X − µ)/σ ∼ N(0, 1).

22

INTRODUCING PROBABILITY AND STATISTICS 6 THE NORMAL DISTRIBUTION

6.3 Transformations to normality

Many data sets are in fact not approximately normal. However, an appropriate transformation of adata set can often yield a data set that does follow approximately a normal distribution. This increasesthe applicability and usefulness of statistical techniques based on the normality assumption. TheBox-Cox transformation is a particularly useful family of transformations defined as:

T (x;λ) =

(xλ − 1)/λ forλ 6= 0log(x) forλ = 0

where λ is an transformation parameter. There are several important special cases: (i) SQUARE ROOT

TRANSFORMATION with λ = 12. If necessary, make all positive by adding a constant before taking the

square root. (ii) LOG TRANSFORMATION with λ = 0. Again, it may be necessary to add a constantto make all values positive before taking logs. (iii) INVERSE TRANSFORMATION with λ = −1.Notice that simply inverting the values would make small numbers large, and large numbers small- this transformation would reverse the order of the values and great care would be needed in theinterpretation. This is not a problem with the Box-Cox transform as the ordering of the values willbe identical to the original data. Data transformations are valuable tools, offering many benefits butgreater care must be used when interpreting results based on transformed data.

6.4 Approximating distributions

Under certain conditions some probability distributions can be approximated by other distributions.Historically, this was important as it gave an easy way to perform probability calculations, but alsoit helps us to understand relationships between distributions and later to understand transformationsfrom one distribution to another.

The POISSON APPROXIMATION TO THE BINOMIAL works well when n is large and p is small. Asa rule of thumb we might consider the approximation satisfactory when, say, n ≥ 20 and p ≤ 0.05

(alternatively when n is large and the expected number of “successes” is small, that is np ≤ 10 say).Another way to think of this is that the Poisson will work well when we are modelling rare eventsin a very large population. The NORMAL APPROXIMATION TO BINOMIAL is reasonable when n islarge, and p and (1 − p) are NOT too small, say np and n(1 − p) must be greater than 5. Note thatthe conditions of Poisson approximation to Binomial are complementary to the conditions for NormalApproximation of Binomial Distribution.

Perhaps the most powerful result is the CENTRAL LIMIT THEOREM (CLT). Suppose we have a randomsample, X1, X2, . . . , Xn, from any distribution with finite mean, E[X], and variance, Var(X), thenthe CLT says that, as the sample size n tends to infinity, the distribution of any sample mean, X , tendsto the normal with mean E[X] and variance equal to Var(X)/n.

23

INTRODUCING PROBABILITY AND STATISTICS 7 DERIVED DISTRIBUTIONS

7 Derived Distributions

7.1 Introduction

Initially it may seem that each probability distribution is unrelated to any other distribution, but in factmany are related. As simple cases the binomial, geometric and negative binomial are all generated byrepeated Bernoulli trials. There are other examples where one random variable can be derived as atransformation of another, or where one random variable is obtained as a sum of others. Perhaps themost widely used transformations involve the normal distribution, such as linear functions, or the sumof squared normal random variables. Less obviously, a normal random variable divided by a sum ofsquared normal random variables or the ratio of two sums of squared normal random variables. Eachof these corresponds to a common example and the answers should be familiar distributions. In thenext sections we will see the mathematical techniques needed to derive many of these results.

7.2 Functions of a random variable

For discrete random variables transformations are straightforward. Assuming that the range spaceand probability mass function of the original random variable are known, then the range space forthe transformed random variable can easily be deduced then the probability can be transferred to theelements of the new range space, using an argument of equivalent events.

The corresponding treatment of continuous random variables is not so straightforward. We are notsimply reallocating probability masses from elements in one range space to elements of another. Inthis situation, we are dealing with the more subtle concept of density of probability. The simplestapproach is to calculate the (cumulative) distribution function of the transformed variable directly,and then differentiate to obtain the density function.

Example Consider an exponential random variable X with parameter λ, X ∼ exp(λ), and letY = X2, then

FY (y) = Pr(Y ≤ y) = Pr(X2 ≤ y) = Pr(X ≤ √y) = FX(√y) = 1− e−λ

√y

Now differentiate to give the density function

fY (y) =d

dyFy(y) =

d

dy

(1− e−λ

√y)

=λ

2√ye−λ

√y

and the range space of Y is SY = [0,∞). Note that this density function is unbounded at the origin,unlike the original density function. Although, normally, y = x2 is not regarded as a one-to-onefunction, over the range space SX it is behaving as one-to-one.

FURTHER READING: Sections 2.3, 3.6 and 4.5 of Rice.

24


Result

Let X be a continuous random variables with p.d.f. fX(). Suppose that g() is a strictly monotonicfunction, then the random variable Y = g(X) has p.d.f. fY given by:

fY (y) =

fX (g−1(y))

∣∣∣∣∣ ddy (g−1(y))

∣∣∣∣∣ y = g(x) for some x

0 y 6= g(x) for any x.

If y = g(x) is not monotonic over the range of X , we split the range into parts for which the functionis monotonic (one-to-one relation holds)

fY (y) =∑

fX(g−1(y)

) ∣∣∣∣∣ ddy (g−1(y))

∣∣∣∣∣ y = g(x)

where the sum is over the separate parts of the range of X for which x and y are in one-to-onecorrespondence.

Example Consider a random variable X with parameter λ, which has the following p.d.f.

fX(x) =λ

2exp (−λ|x|) , −∞ < x <∞.

This density function looks like two exponential functions place back-to-back, and hence is oftenreferred to as the double exponential, or less descriptively as the Laplace distribution.

Consider the transformation Y = X2, clearly with this range space for X , the transformation is notone-to-one. However, by dividing the range into two −∞ < x < 0 and 0 < x < ∞, y = x2 ismonotonic over each half separately.

For (−∞, 0),X = −√Y , dx/dy = −1

2y−

12 and fX(x) = λ

2exp (λx), hence fY (y) = λ

4√y

exp(−λ√y

).

For (0,∞), X =√Y , dx/dy = 1

2y−

12 and fX(x) = λ

2exp (−λx), hence fY (y) = λ

4√y

exp(−λ√y

).

Summing the parts from the two ranges gives

fY (y) =λ

2√y

exp (−λ√y) y ≥ 0.

The same distribution as in the earlier example involving the exponential distribution and the trans-formation Y = X2.

25


7.3 Transforming bivariate random variables

Suppose we wish to find the joint probability density function of a pair of random variables, Y1 and Y2,which are given functions of two other random variables, X1 and X2. Further that Y1 = g1(X1, X2)

and Y2 = g2(X1, X2), and that the joint probability density function of X1 and X2 is fX1,X2(x1, x2).

We assume the following conditions:

(I) The transformation (x1, x2) 7→ (y1, y2) is one-to-one. That is we can solve the simultaneousequations y1 = g1(x1, x2) and y2 = g2(x1, x2), for x1 and x2 to give x1 = h1(y1, y2) andx2 = h2(y1, y2) (say). Transformations which are not one-to-one can be handled, but are morecomplicated except in special cases – such as for sums of independent random variables.

(II) The functions h1 and h2 have continuous partial derivatives and the Jacobian determinant iseverywhere finite (that is |J | <∞) where

J = det

[∂x1

∂y1

∂x1

∂y2∂x2

∂y1

∂x2

∂y2

]Note that there are other ways to write this, ‘all’ are equivalent.

Then,fY1,Y2(y1, y2) = |J |fX1,X2(x1, x2)

substituting for x1 = h1(y1, y2) and x2 = h2(y1, y2) where necessary. The range space for (y1, y2) isobtained by applying the inverse transformation to the constraints on x1 and x2.

Example If X1 and X2 are independent exponential random variables each with parameter λ, then

fX1,X2(x1, x2) = fX1(x1)fX2(x2) = λ2eλ(x1+x2), x1, x2 ≥ 0.

Now, if Y1 = X1 + X2 and Y2 = eX1 then x1 = h1(y1, y2) = log(y2) and x2 = h2(y1, y2) =

y1 − log(y2). Now, the Jacobian matrix is[∂x1

∂y1

∂x1

∂y2∂x2

∂y1

∂x2

∂y2

]=

[0 1

y2

1 − 1y2

]and so the absolute value of its determinant is |J | = 1/y2 (this is finite because it can also be shownthat y2 ≥ 1). Then

fY1,Y2(y1, y2) =1

y2

λ2eλy1 , y1 ≥ log y2, y2 ≥ 1.

7.4 Sums of independent random variables

Some results

1. If X1, ..., Xn are independent Poisson random variables with parameters λ1, ..., λn, then X1 +

...+Xn also has a Poisson distribution with parameter (λ1 + ...+ λn).

26


2. If X1, ..., Xk are independent Binomial random variables with parameters (n1, p), ..., (nk, p),then X1 + ...+Xk also has a Binomial distribution with parameters (n1 + ...+ nk, p).

3. If X1, ..., Xn are independent gamma random variables with parameters (t1, λ), ..., (tn, λ), thenX1 + ...+Xn also has a gamma distribution with parameters (t1 + ...+ tn, λ).

4. If X1, ..., Xn are independent normal random variables with parameters (µ1, σ21), ..., (µn, σ

2n),

then X1 + ...+Xn also has a normal distribution with parameters (µ1 + ...+µn, σ21 + ...+ σ2

n).

Direct method

If X and Y are independent random variables then the probability function for Z = X + Y is

pZ(z) =∑x

pX(x)pY (z − x) =∑y

pX(z − y)pY (y) if discrete

fZ(z) =

∫x

fX(x)fY (z − x)dx =

∫y

fX(z − y)fY (y)dy, if continuous.

Using generating functions

The above results can be derived most easily using moment generating functions (or probability gen-erating functions for the discrete cases) using the result that if Z = X1 + ... + Xn and the X areindependent then MZ(t) =

∏MXi(t). Of course we must be able to recognise the mgf of Z to

identify the distribution.

27


7.5 Distributions derived from the Normal distribution

The most frequently used techniques in statistics are the t-test and the F-test. These are used tocompare means of two or more samples and to make inferences about the population means fromwhich the samples were drawn. The test statistic in each case is not an arbitrary function of the data,but is chosen to have useful properties. In particular the function is chosen so that its distribution isknown.

• If X has a standard normal distribution and independently Y has a chi-squared distribution with νdegrees of freedom then X√

Y/νhas a t-distribution with ν degrees of freedom.

• If X1 and X2 have independent chi-squared distributions with ν1 and ν2 degrees of freedom thenX1/ν1X2/ν2

has an F-distribution with degrees of freedom ν1 and ν2.

Preliminary Results: Distribution of the mean and variance

Consider a random sample X1, ..., Xn from a normal population with mean µ and variance σ2, that isXi ∼ N(µ, σ2), i = 1, ..., n. If we define X = 1

n

∑Xi and S2 = 1

n−1

∑(Xi − X)2, then

(a) X ∼ N(µ, σ2/n),

(b) (n− 1)S2/σ2 ∼ χ2n−1, and

(c) X and S2 are independent.

The t-distribution

Suppose we have a random sampleX1, ..., Xn from a normal population, Xi ∼ N(µ, σ2), i = 1, ..., n,with sample mean X and variance S2. If σ2 is known, then X−µ

σ/√n∼ N(0, 1), whereas, if we estimate

σ2 by S2, then X−µS/√n∼ tn−1, that is a t-distribution with degrees of freedom n− 1.

The F-distribution

Suppose we have two independent random samples of size n1 and n2 from normal populationsN(µ1, σ

21) and N(µ2, σ

22) with sample means and variances X1, S2

1 and X2, S22 . Imagine we want

to test Ho : σ21 = σ2

2 . If Ho is true then S21/S

22 ≈ 1; is false either S2

1/S22 is large (σ2

1 > σ22) or S2

1/S22

is close to zero (σ21 < σ2

2).

Now X1 = (n1 − 1)S21/σ

21 ∼ χ2

n1−1 and X2 = (n2 − 1)S22/σ

22 ∼ χ2

n2−1 thus

F =X1/(n1 − 1)

X2/(n2 − 1)=S2

1/σ21

S22/σ

22

=S2

1

S22

under H0

and so F ∼ Fn1−1,n2−1, an F-distribution with degrees of freedom (n1 − 1) and (n2 − 1).

FURTHER READING: Sections 3.6 and 4.5 of Rice.

28


Exercises

(6.1) Let Z be a standard normally distributed random variable then find: (a) Pr(Z < 2), (b)Pr(−1 < Z < 1) and (c) Pr(Z2 > 3.8416).

Hint: In (c) find the probability of an equivalent event involving Z and not Z2.

(6.2) Suppose that X is a normally distributed random variable with mean µ and standard deviationσ > 0, then it has MGF given by

MX(t) = exp

µt+

1

2σ2t2

.

Find the MGF of X∗ = (X − µ)/σ and hence state the distribution of X∗.

(6.3) Let X be a normally distributed random variable with mean 10 and variance 25.

(a) Evaluate the probability Pr(X 6 8).

(b) Evaluate the probability Pr(15 6 X 6 20).

(6.4) Let X follow a binomial distribution and parameters n = 50 and p = 0.52.

What are the expectation and variance of X? Hence write down the normal distribution whichapproximates this binomial distribution. Is this likely to be a good approximation?

Use the normal approximation, with a continuity correction, to evaluate the probability that Xis at least 30. Why is the continuity correction needed?

(7.1) If X is a continuous random variable with a uniform distribution on the interval [0, 1], that iswith PDF

fX(x) = 1 for 0 < x < 1,

then find the PDF of Y = − log(X)/λ where λ > 0. Name the distribution.

(7.2) Suppose X1, . . . , Xn are independent normal random variables, with corresponding parameters(µ1, σ

21), . . . , (µn, σ

2n), then, using MGFs, show that Sk = X1 + · · · + Xn also has a normal

distribution. What are the parameters of this new distribution?

Suppose now that the random variables are also identically distributed, that is with commonmean µ and variance σ2. What can be said about the distribution of X = 1

nSk?

29

INTRODUCING PROBABILITY AND STATISTICS 8 BAYESIAN METHODS

8 Bayesian Methods

8.1 Introduction

The Bayesian approach to statistics is currently very fashionable and respectable, but this has notalways been the case! Until, perhaps, 20 years ago Bayesian statisticians were seen as extremist andfanatical. Leading statisticians of the day considered their work was unimportant and even “danger-ous”. The main reason for this lack of trust is the subjective nature of some of the modelling. The keydifference, compared to classical statistics, is the use of subjective knowledge in addition to the usualinformation from data.

Suppose we are interested in parameter θ. In the standard setting, we would perform an experimentand use the data to estimate θ. But in practice we might have some knowledge about θ before doingthe experiment and want to incorporate this prior degree of belief about θ into the estimation process.

Let π(θ) be our prior density function for θ quantifying our prior degree of belief. From the datawe can calculate the likelihood, L(X|θ). These two sources of information can be combined to give,π(θ|X), the posterior distribution of θ reflecting our belief about θ after the experiment.

Recall Bayes Theorem defined in terms of probabilities of events (A and B say),

P (A|B) =P (A ∩B)

P (B)=P (B|A)P (A)

P (B)

The appropriate form of this for our situation is

π(θ|x) =L(x|θ)π(θ)

p(x)

Note however that the divisor is unimportant when making inference about θ and so we can simplysay

π(θ|x) ∝ L(x|θ)π(θ)

that is “Posterior pdf is proportional to Likelihood times prior pdf”

The Bayesian method gives a way to include extra information into the problem and can make logicalinterpretation easier.

Although the approach is straightforward, there can be serious (algebraic) difficulties deriving theposterior distribution. Also, there are many possible choices for the prior distribution - with thechance that the final conclusion might depend on this subjective choice of prior. One approach tochoice of prior is to use a non-informative prior (such as the uniform in the following example) whichdoes not have an influence on the modelling, or a vague prior where the influence is mild. To makederiving the posterior distribution easier, and to give a standard approach to choice of prior it iscommon to use a conjugate prior. That is, given the likelihood, the prior is chosen so that the priorand posterior distributions are in the same family.

30


8.2 Conjugate prior distributions

To be able to progress much further we must first consider two new examples of continuous distribu-tions, the beta distribution and the gamma distribution. These are particularly important as they areconjugate prior distributions for several widely used likelihood models. For example the beta is theconjugate prior for all the distribution based on the Bernoulli, that is the geometric and binomial. Thegamma is the conjugate prior distribution for the Poisson and the exponential. However, for the mostwidely encountered data model, the normal distribution, it is the normal distribution itself which isthe conjugate prior.

Beta distribution, β(p, q)

f(x) =xp−1(1− x)q−1

B(p, q)0 ≤ x ≤ 1; p, q > 0.

where B(α, β) =∫ 1

0xα−1(1− x)β−1dx.

Notes that B(α, β) = Γ(α)Γ(β)/Γ(α + β) and that E[X] = α/(α + β) and Var[X] = αβ/(α +

β)2(α + β + 1).

As a special case, when α = β = 1, then this reduces to the continuous uniform distribution on theinterval (0, 1).

Gamma distribution, γ(α, λ)

f(x) =λαxα−1e−λx

Γ(α)x ≥ 0; α, λ > 0.

where Γ(α) =∫∞

0xα−1e−xdx.

Note that Γ(α+1) = αΓ(α) for all α, hence Γ(α+1) = α! for integers α > 1, and that Γ(1/2) =√π.

Also, E[X] = α/λ and Var[X] = α/λ2.

As important special cases we have (a) when α = 1 then this reduces to the exponential distributionwith parameter λ, and (b) when α = ν/2 and λ = 2 then it becomes the chi-squared distribution withν degrees of freedom χ2

ν .

31


Example: Coin tossing: Let θ be the probability of getting a head with a biased coin. In n tossesof the coin we observe X = x heads, then

p(x|θ) =

(n

x

)θx(1− θ)n−x, x = 0, 1, 2, . . .

Now suppose we only know that θ is on the probability scale, and so we have a uniform prior,

π(θ) = 1, 0 < θ < 1.

Now posterior is proportional to likelihood times prior,

π(θ|x) ∝ p(x|θ)π(θ) =

(n

x

)θx(1− θ)n−x . 1 ∝ θx(1− θ)n−x

Notice that this is the form of a Beta distribution, that is it depends on the variable, θ, in the correctway. Hence the posterior distribution is Beta and we can identify the parameters as p = x + 1 andq = n− x+ 1, that is θ|x ∼ β(x+ 1, n− x+ 1). We can now write-down the pdf

π(θ|x) =θx(1− θ)n−x

B(x+ 1, n− x+ 1).

8.3 Point and interval estimation

In classical statistics we have been interested in estimating a parameter θ. This can also be done inBayesian statistics. Recall that the posterior distribution contains all the information about θ - hencewe based all our estimation on the posterior pdf.

Natural estimators of θ are: the Posterior Mean or Bayes Estimator that is E[θ|X = x], and thePosterior Mode or Maximum a Posterior (MAP) Estimator. The MAP estimator is the most likelyvalue of θ and is the analogue of the maximum likelihood estimator.

To reflect the precision in this estimation we can construct a credibility interval (the equivalent ofthe classical confidence interval). A 100(1 − α)% credibility interval for θ can be found using theprobability statement

Pr(θL ≤ θ ≤ θU) = 1− α

This can be interpreted as, the probability of θ being inside the interval is 1 − α (this is much moreintuitive than the interpretation of the classical confidence interval).

On its own this does not give a unique definition of the interval and so we can introduce the extracondition that says

Pr(θ ≤ θL) = Pr(θ ≥ θU) = α/2

this is called the equal-tailed interval.

Example: Coin tossing (Continued): Since the posterior pdf is a Beta distribution we alreadyknow equations for the two point estimators: the mean is (x+ 1)/(n+ 2) and the mode is x/n (whichis the same as the MLE).

32


Exercises

(8.1) Suppose that we have a single observation x from a Poisson distribution with parameter θ.Derive the posterior distribution, π(θ|x), when the prior distribution of θ is a Gamma(a, b)

distribution. Show that this is a conjugate prior.

Write down the posterior mean θ = E[θ|X = x] and the maximum a posterior (MAP) estimatorθ = argmaxπ(θ|X = x).

For observation x = 3, and prior parameters a = 2 and b = 0.7, what is the correspondingposterior distribution. Draw a graph of the prior and posterior distributions and comment. Findthe posterior mean and the MAP estimate.

(8.2) The number of defective items, X , in a random sample of n has a Binomial distribution wherethe probability that a single item is defective is θ (0 < θ < 1). If the prior distribution of θ is theBeta distribution with parameters α and β, obtain the posterior distribution of θ given X = x.Determine the posterior mean E[θ|X = x].

In a particular case it is found that: n = 25, x = 8 and the prior belief about θ can be sum-marised by a distribution with prior mean 1

2and prior standard deviation 1

4.

Determine the posterior mean µ = E[θ|X = x] and obtain the posterior standard deviation(which gives an indication of the precision of the posterior mean).

(8.3) Suppose that we have a single observation x from an exponential distribution with parameter θ.Consider a Gamma(a, b) distribution as a prior for θ. Derive the posterior distribution, π(θ|x),and show that this is a conjugate prior.

Write down the posterior mean θ = E[θ|X = x] and the maximum a posterior (MAP) estimatorθ = argmaxπ(θ|X = x).

With data x = 4.8, and prior parameters a = 10 and b = 1.5, what is the corresponding pos-terior distribution. Find the posterior mean and the MAP estimate. Also calculate the posteriorstandard deviation. Comment.

33


34

INTRODUCING PROBABILITY AND STATISTICS PRACTICAL EXERCISES

Practical Exercises to be Completed Using R

Tomorrow, you will meet the R statistical programming environment. Once you are familiar with R,you might like to try out the exercises below.

The following simple exercises will allow you to check some of your early answers, but will alsorequire use of a range of R functions. As well as performing more complicated statistical analyses, Ris very useful for performing calculations and for plotting graphs. Over the page is a more complicatedexample where, although the individual calculations are simple, it would be too time consuming toperform by-hand.

1. In Exercise (1.1), evaluate the expected value and variance of X .

Hint: Define vectors for x and the probabilities, take element-wise product then sum.

2. In Exercises (1.2), plot a graph of the probability density function.

Hint: Use the curve command.

3. In Exercises (3.1), evaluate the two probabilities that the student passes the exam.

Hint: Use the pbinom command.

4. In Exercises (5.1), evaluate the fitted frequencies using the estimated value of p.

Hint: Use the dbinom command.

5. In Exercises (5.2), calculate the test statistic and the corresponding p-value.

Hint: Use the pt command, or the t-test command.

6. In Exercises (6.1), calculate the three probability values for the standard normal.

Hint: Use the pnorm command.

7. In Exercises (6.3), calculate the two probability values for the normal random variable withmean 10 and variance 25.

Hint: Use the pnorm command giving the mean and standard deviation.

8. In Exercises (6.4), calculate the exact binomial probability and compare to the previously foundapproximation.

Hint: Use the pbinom command.

9. In Exercises (7.1), simulate some data from the continuous uniform distribution, and draw ahistogram. Does this look consistent with the exponential?

Hint: Use the runif and hist commands.

10. In Exercises (7.2), simulated two, equal-sized, samples each from a different normal distri-bution, and calculate the element-wise sum. Draw a histogram and evaluate the mean andvariance. Are these consistent with the theoretical result?

Hint: Use the rnorm, hist, mean and var commands.

35


Extended Practical Exercise

Suppose that 100 people are subject to a blood test. However, rather than testing each individualseparately (which would require 100 tests), the people are divided into groups of 5 and the bloodsamples of the people in each group are combined and analysed together. If the combined test isnegative, one test will suffice for the whole group; if the test is positive, each of the 5 people inthe group will have to be tested separately, so that overall 6 tests will be made. Assume that theprobability that an individual tests positive is 0.02 for all people, independently of each other.

In general, let N be the total number of individuals, n be the number in each group (with k = N/n

the number of groups), and p the probability that an individual tests positive.

Consider one group of size n, and let Ti represent the number of test required for the ith group(i = 1, . . . , k). The combined test is negative, and hence one test will be sufficient, with probability

Pr(Ti = 1) = Pr(Combined test is negative) = (1− p)

otherwise it is positive, and n+ 1 test are required, with probability

Pr(Ti = n+ 1) = Pr(combined test is positive) = 1− (1− p)n.

Now the expected number of tests for the ith group is

E[Ti] = 1× Pr(Ti = 1) + (n+ 1)× Pr(Ti = n+ 1)

= (1− p)n + (n+ 1) (1− (1− p)n) = (n+ 1)− n(1− p)n.

The expected total number of test, E[T ], is then given by the sum of the expected numbers for eachgroup

E[T ] = E[T1] + · · ·+ E[Tk] = k × ((n+ 1)− n(1− p)n) .

For the given values,N = 100, n = 5, p = 0.05, the expectations areE[Ti] = 6−5 (0.98)5 ≈ 1.4804.

and therefore, E[T ] = 20 × 1.4804 ≈ 29.6079 ≈ 30. So on average only 30 tests will be required,instead of 100.

But, for the given total number of people, is this the best choice of n, and what happens as p varied?Use R to repeat the above calculations, then try other values of n (which lead to integer k) to see ifn = 5 gives the smallest expected total number of tests. Repeat this process p = 0.01 and p = 0.5,and comment on the best choice of group size.

If possible, produce a line graph of expected total number of tests, E[T ], against group size, n, withseparate lines for different values of p. Also, a graph of the optimal choice of n against p. Commenton these graphs.

36


Solutions to Practical Exercises

1. > x=c(-3,-1,0,1,2,3,5,8)> p=c(0.1,0.2,0.15,0.2,0.1,0.15,0.05,0.05)> xp = x*p> x2p = x**2*p> ex = sum(xp)> ex2 = sum(x2p)> ex2 - ex**2

2. > curve((3/4)*x*(2-x),0,2)

3. > 1-pbinom(3,20,1/5)> 1-pbinom(3,20,1/3)

4. > x=c(0,1,2,3,4,5)> f=c(8,40,88,110,56,18)> p=sum(x*f/320)/5> dbinom(0:5,5,p)*320

5. > x=c(0.05,0.29,0.39,-0.18,0.11,0.15,0.35,0.28,-0.17,0.07)> xm = mean(x)> xsd = sd(x)> tobs = (xm-0)/(xsd/sqrt(10))> 2*(1-pt(2.11,9))>> t.test(x)

6. > pnorm(2)> pnorm(1)-pnorm(-1)> pnorm(1.96)-pnorm(-1.96)

7. > pnorm(20,10,5)-pnorm(15,10,5)> pnorm(8,10,5)

8. > 1-pbinom(29,50,.52)

9. > x=runif(100)> y=-log(x)> hist(y)> mean(y)>> x=runif(1000)> y=-log(x)/5> hist(y)> mean(y)

10. > x=rnorm(1000)> hist(x)> mean(x)> var(x)> y=rnorm(1000,10, 5)> mean(y)> var(y)>> z=x+y> hist(z)> mean(z)> var(z)> sd(z)

Extended Exercise> N=100> n=5> p=0.02> k=N/n> eT = k*((n+1)-n*(1-p)**n)>>> ns = 1:20> eT = rep(0,length(ns))>> p=0.02>> for (i in 1:length(ns))+ k=N/ns[i]+ eT[i] = k*((ns[i]+1)-ns[i]*

(1-p)**ns[i])+ >> plot(ns,eT)

37


38

INTRODUCING PROBABILITY AND STATISTICS SOLUTIONS TO EXERCISES

Solutions to Exercises

(1.1) Clearly all probabilities are between 0 and 1, and they sum to 1. Hence they define a validprobability distribution.

[Note that only checking that the probabilities sum to 1 is not sufficient, as for example both(3/5, 3/5,−1/5) and (1/4, 5/4,−1/2) sum to 1, but violate other conditions.]

For the first two, add the appropriate probabilities to give 0.45 and 0.3 respectively.

To calculate the means and variances we extend the probability table with two extra rows:

x -3 -1 0 1 2 3 5 8pX(x) 0.1 0.2 0.15 0.2 0.1 0.15 0.05 0.05xpX(x) -0.3 -0.2 0 0.2 0.2 0.45 0.25 0.4x2pX(x) 0.90 0.20 0 0.2 0.4 1.35 1.25 3.2

Summing the last two rows, we obtain E[X] =∑xpX(x) = 1 and E[X2] =

∑x2pX(x) =

7.5, and so V ar(X) = E[X2]− (E[X])2 = 7.5− 12 = 6.5.

(1.2) First note that for fX(x) = cx(2 − x) to be a valid density we require the p.d.f. to be alwaysnon-negative. Clearly, for 0 ≤ x ≤ 2 we require c ≥ 0, and note that for x outside this rangefX(x) = 0 by definition. Also using the fact that

∫∞−∞ fX(x)dx = 1:∫ ∞

−∞fX(x)dx = c

∫ 2

0

(2x− x2)dx = c(x2 − x3/3)

)∣∣20

= 4c/3 hence c = 3/4.

Recall the definition of the c.d.f.: FX(x) = P (X ≤ x) =∫ x−∞ fX(y)dy.

[To avoid possible confusion between the variable over which we are integrating and the upperlimit of integration, it is usually safest to re-label one of them.]

Note that if x < 0 then FX(x) = 0, and if x > 2 then FX(x) = 1. If 0 ≤ x ≤ 2 then

FX(x) =

∫ x

−∞fX(y)dy =

∫ x

0

fX(y)dy =3

4

(x2 − x3

3

).

As a result we can write:

FX(x) =

0 for x < 0,34

(x2 − x3

3

)for 0 ≤ x ≤ 2,

1 for x > 2.

[Make sure that you define the cumulative distribution function for all real values and includeFX(x) = 0 and FX(x) = 1 in the answer.]

Straight from the c.d.f. we have P (X > 1) = 1−P (X ≤ 1) = 1−FX(1) = 1− 34

(1− 1

3

)= 1

2.

39


(2.1) First find the marginal of Y by summing over the values of x to give:

pY (y) =

4/32 y = −2,

12/32 y = −1,12/32 y = 0,4/32 y = 1.

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

0.0

0.1

0.2

0.3

0.4

0.5

y

f(y)

Then the cumulative distribution function usingFY (y) = Pr(Y ≤ y) gives:

FY (y) =

0 y < −2,

4/32 −2 ≤ y < −1,16/32 −1 ≤ y < 0,28/32 0 ≤ y < 1,

1 1 ≤ y.−4 −3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

y

F(y

)

[Notice that we must define the cdf for all real numbers even though we are only really interestedin the central part. Marks would be lost for missing these “extreme” values.]

For the conditional distribution, first find the marginal of X by summing over y to give:

x 1 2 3pX(x) 8/32 8/16 8/32

Then use: pY |X(y|x) = p(x, y)/pX(x) to give:

Value of yx -2 -1 0 11 1/8 3/8 3/8 1/82 1/8 3/8 3/8 1/83 1/8 3/8 3/8 1/8

[Here the p.m.f.s are shown as tables – compare to (a) above – either approach is fine. Alsoeach row of the conditional probabilities table is a probability distribution and so sums to 1.]

Since the conditional distribution of Y given X = x does not depend on x (equivalently, theconditional distribution is equal to the marginal), then X and Y are independent.

40


(2.2) First the marginal of X by integrating over the variable we do not want, i.e. over y:

fX(x) =

∫ 2

0

6

7

(x2 +

xy

2

)dy

=

[6

7

(x2y +

xy2

4

)]2

0

=6

7

(2x2 + x

), 0 ≤ x ≤ 1.

Similarly,fY (y) = 6

7(1

3+ y

4), 0 ≤ y ≤ 2.

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.5

1.0

1.5

2.0

2.5

x

f(x)

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

y

f(y)

Now the cdf:

FX(x) =

0 x < 0

67

(2x3

3+ x2

2

)0 ≤ x ≤ 1

1 1 < x[FY (y) = 6

7

(y3

+ y2

8

), 0 ≤ y ≤ 2.

]−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.2

0.4

0.6

0.8

1.0

x

F(x

)

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

yF

(y)

The expectation,

E[X] =

∫ 1

0

x6

7(2x2 + x)dx =

6

7

∫ 1

0

(2x3 + x2)dx =6

7

[2x4

4+x3

3

]1

0

=6

7

[2

4+

1

3

]=

5

7.

E[X(X − Y )] =

∫ 1

0

∫ 2

0

(x2 − xy)6

7(x2 +

xy

2)dydx =

6

7

∫ 1

0

∫ 2

0

(x4 − x3y

2− x2y2

2)dydx

=6

7

∫ 1

0

[x4y − x3y2

4− x2y3

6

]2

0

dx =6

7

∫ 1

0

(2x4 − x3 − 4x2

3)dx

=6

7

[2x5

5− x4

4− 4x3

9

]1

0

= − 53

210

E[X|Y = 1] =

∫ 1

0

xfX|Y (x|y)dx =

∫ 1

0

xf(x, y)

fY (y)dx

so we must first evaluate the marginal density of Y ,

fY (y) =

∫ 1

0

6

7

(x2 +

xy

2

)dx =

6

7

[x3

3+x2y

4

]1

0

=6

7

(1

3+y

4

)and fY (1) = 1

2, so

E[X|Y = 1] =

∫ 1

0

x6/7(x2 + x/2)

1/2dx =

12

7

[x4

4+x3

6

]1

0

=5

7

41


(2.3) Let D be the event that the tested person has the disease and B the event that his/her test resultis positive. Then, according to the information given in the question, we have

Pr(B|D) = 0.8, P r(B|Dc) = 0.05, P r(D) = 0.004.

Using Bayes formula, we obtain

Pr(D|B) =Pr(D ∩B)

Pr(B)=

Pr(B|D) · Pr(D)

Pr(B|D) · Pr(D) + Pr(B|Dc) · Pr(Dc)

=0.8 · 0.004

0.8 · 0.004 + 0.05 · 0.996≈ 0.0604.

Remark. This probability may look surprisingly small. An explanation may be as follows.Since 0.4% of the population actually have the disease, it follows that, on average, 40 personsout of every 10,000 will have it. The test will (on average) successfully reveal the diseasein 40 · 0.8 = 32 cases. On the other hand, for 9,960 healthy persons, the test will state thatabout 9, 960 · 0.05 ≈ 498 of them are ‘ill’. Therefore, the test appears to be positive in about32 + 498 = 530 cases, but the fraction of those who actually have the disease is approximatelygiven by 32/530 ≈ 0.0604.

(3.1) The exam results can be modelled by Bernoulli trials with probability of success p = 1/5 in part(a) and p = 1/3 in part (b). If X is the number of correct answers, then X has the distributionBin(n = 20, p) with probabilities

Pr(X = k) =

(20

k

)pk(1− p)20−k k = 0, . . . , 20.

Noting that 20% of 20 is 4, the probability to pass the exam is given by

Pr(X ≥ 4) = 1− Pr(X < 4) = 1−3∑

k=0

Pr(X = k).

The results are shown in the table:

p Pr(X = 0) Pr(X = 0) Pr(X = 2) Pr(X = 3) Pr(X ≥ 4)

1/5 0.0115 0.0576 0.1369 0.2054 0.5886

1/3 0.0003 0.0030 0.0143 0.0429 0.9396

giving the answers: (a) 0.5886 (b) 0.9396.

(3.2) (a) For the exponential the c.d.f. is

Pr(X ≤ x) = FX(x) = 1− e−λx x ≥ 0

and soPr(X > x) = 1− Pr(X ≤ x) = 1−

(1− e−λx

)= e−λx.

Hence, with λ = 2,

Pr

(X >

1

2

)= e−2× 1

2 = e−1 = 0.3679 (4 s.f.)

42


(b) We require x such that FX(x) = 12

(note that this is the median), that is x such that 1−e−2x =12

hence 12

= e−2x and − log 2 = −2x, so x = 12

log 2 = 0.3466 (4 s.f.)

(c) From the definition of conditional probability:

Pr

(X > 1 | X >

1

2

)= Pr

(X > 1 ∩X >

1

2

)/Pr

(X >

1

2

),

but note that

(X > 1) ⊂(X >

1

2

)and so

(X > 1) ∩(X >

1

2

)= (X > 1).

Hence we require

Pr(X > 1)

Pr(X > 1

2

) =e−2×1

e−2× 12

=e−2

e−1= e−1 = 0.3679 (4 s.f.)

(3.3) The moment generating function of the Poisson distribution is found as follows.

MX(t) = E[etX ] =∑

etxλxe−λ

x!= e−λ

∑ (λet)x

x!

= eλ(et−1)∑ (λet)xe−λe

t

x!= eλ(et−1)

where the sum is of the Po(λet) (check this) and so is equal to 1.

[Note that, as with the derivation of the binomial m.g.f. here we could directly use the seriesexpansion of the exponential.]

Here differentiating and setting t = 0 gives:

dMX(t)

dt= λeteλ(et−1) d2MX(t)

dt2= λeteλ(et−1) + (λet)2eλ(et−1)

and so E[X] = λ, E[X2] = λ+ λ2. Hence V ar(X) = λ.

The moment generating function of the Poisson distribution, Xi is

MXi(t) = eλi(et−1)

and so, if Sk = X1 + ...+Xn, then the moment generating function of Sk is

MSk(t) =k∏i=1

MXi(t) =k∏i=1

eλi(et−1) = e

Pλi(e

t−1)

with is the moment generating function of a Poisson random variable with parameter∑λi, that

is Sk ∼ Po(∑λi). And hence the mean and variance of Sk are both equal to

∑λi.

[In the last two questions, we see the power of the moment generating function. We are pro-ducing important results without too much difficulty.

43


(4.1) The coding centres the x-values, meaning that∑xi = 0, and hence x = 0.

The remaining calculations are as follows.

Sugar, y Coded temp., x x2i xiyi

8.1 -0.5 0.25 -4.057.8 -0.4 0.16 -3.128.5 -0.3 0.09 -2.559.8 -0.2 0.04 -1.969.5 -0.1 0.01 -0.958.9 0.0 0.00 0.008.6 0.1 0.01 0.86

10.2 0.2 0.04 2.049.3 0.3 0.09 2.799.2 0.4 0.16 3.68

10.5 0.5 0.25 5.25100.4 0.0 1.10 1.99

Fig: Scatterplot of sugar remaining againstcoded temperature, with fitted equation.

−0.4 −0.2 0.0 0.2 0.48.

08.

59.

09.

510

.010

.5

x

y

Since y = 9.13 and Sxy/Sxx =∑xiyi/

∑x2i = 1.81 (to 2 dp) we get the regression equation

sugar = 9.13 + 1.81× coded temp

where “sugar” is the sugar remaining after fermentation and “coded temp” is the fermentationtemperature minus 21.5 degrees centigrade.

Alternatively, we could give the regression equation as

sugar = −29.77 + 1.81× temp

where “temp” is in degrees centigrade. Either form is correct, but you need to be clear as towhich form you have used.

(4.2) Re-arranging the model gives us εi = yi − α − β1xi − β2wi (i = 1, . . . , n). Hence the sum ofsquared errors is

S =∑i

ε2i =

∑i

(yi − α− β1xi − β2wi)2.

Differentiating S w.r.t. α, we get

∂S

∂α= −2

∑i

(yi − α− β1xi − β2wi)

= −2(ny − nα)

since∑

i xi =∑

iwi = 0 as these variables are already centred. Setting this differential to zerowhen α = α, we get α = y (as in the one-predictor case).

44


To find β1, we set ∂S/∂β1 to zero when β1 = β1:

0 =∑i

xi(yi − α− β1xi − β2wi)

= Sxy − α∑i

xi − β1Sxx − β2Swx

⇒ Sxy = β1Sxx + β2Swx

⇒ β1 =1

Sxx(Sxy − β2Swx).

This leaves us with one equation in two unknowns. To get round this, we substitute β2 for β2.Hence we need to find β2 by differentiating S w.r.t. β2 to get β2 = (Swy − β1Swx)/Sww. Withthis substitution, we get

Sxxβ1 = Sxy −SwxSww

(Swy − β1Swx

)= Sxy −

SwxSwySww

+ β1S2wx

Sww

⇒ β1 =

(Sxx −

S2wx

Sww

)−1(Sxy −

SwxSwySww

)=

(1− S2

wx

SwwSxx

)−1(SxySxx− SwxSwySwwSxx

)

as required.

(5.1) Each child is male or female with some fixed (but unknown) probability, and we are consideringfamilies of five children so a suitable model is Binomial, X ∼ B(m = 5, p). We need also toassume independence of children within a family.

To estimate the probability, p, use x/m = 0.5375. The corresponding fitted frequencies are:6.8, 39.4, 91.5, 106.3, 61.8, 14.4; which are pretty close to the observed frequencies.

(5.2) Let X be the daily log exchange rate, and we are told that X ∼ N(µ, σ2) is a acceptable model.To estimate the unknown parameters we use µ = x = 0.134 and σ = sn−1 = 0.2002. To testthe given hypothesis we use the t-test (as the population variance is unknown) with test statistictobs = (x− µ0)/(s/

√n) = (0.134− 0)/(0.2002/

√10) = 2.1. For a 5% test, the critical value

is tcrit such that Pr(Tn−1 > tcrit) = 0.025 where T follows a t-distribution with n − 1 = 9degrees of freedom. From the tables, Pr(T9 > 2.262) = 0.025. In our case tobs is not greaterthat tcrit and hence there is not sufficient evidence to reject the null hypothesis. (From R thep-value is 0.06342, hence the same conclusion.)

(6.1) From the statistical tables: (a) 0.9772, (b) Pr(−1 < Z < 1) = Pr(Z < 1)− Pr(Z < −1) =Pr(Z < 1) − (1 − Pr(Z < 1)) = 2 × Pr(Z < 1) − 1 = 2 × 0.8413 − 1 = 0.6826, and(c)Pr(Z2 > 3.8416) ≡ Pr(−1.96 < Z < 1.96) = 2 × Pr(Z < 1.96) − 1 ≈ 2 × Pr(Z <1.95)− 1 = 2× 0.9744− 1 = 0.9588. (Note that retaining 1.96 the answer is 0.95.)

45


(6.2) With the given MGF, MX(t) = expµt+ 1

2σ2t2

. and using Result 4 in Section 3.2 with

a = 1/σ and b = −µ/σ, then the MGF of X∗ is

MX∗(t) = exp −µt/σMX(t/σ) = exp −µt/σ × exp

µt/σ +

1

2σ2(t/σ)2

= exp

−µt/σ + µt/σ +

1

2σ2 t

2

σ2

= exp

1

2t2.

Which is of the same form as the original MGF but with mean zero and unit variance, henceX∗ ∼ N(0, 1) by the uniqueness of MGF.

(6.3) Let X ∼ N(µ = 10, σ2 = 25) and Z ∼ N(0, 1).

To evaluate the probabilities stated in (a) and (b), we must first standardise X so we can referto the standard normal table. From the lecture notes, we know that X = σZ + µ. So:

Pr(X 6 x) = Pr(σZ + µ 6 x) = Pr

(Z 6

x− µσ

)= Φ

(x− µσ

),

where Φ(z) = FZ(z) = Pr(Z 6 z).

(a) To evaluate Pr(X 6 8), we first must standardise:

Pr(X 6 8) = Pr

(Z 6

8− µσ

)= Φ

(8− 10

5

)= Φ(−0.4).

As Φ(−z) = 1− Φ(z),

Φ(−0.4) = 1− Φ(0.4) = 1− 0.6554 = 0.3446.

(b) We can rewrite Pr(15 6 X 6 20) in terms of the following cumulative probabilities:

Pr(15 6 X 6 20) = Pr(X 6 20)− Pr(X 6 15).

Note: When considering continuous random variables, Pr(X 6 x) = Pr(X < x).

The next step is to standardise:

Pr(X 6 20) = Pr

(Z 6

20− µσ

)= Φ

(20− 10

5

)= Φ(2).

P r(X 6 15) = Pr

(Z 6

15− µσ

)= Φ

(15− 10

5

)= Φ(1).

From the normal tables, we find that Φ(2) = 0.9772 and Φ(1) = 0.8413. Hence:

Pr(15 6 X 6 20) = 0.9772− 0.8413 = 0.1359.

(6.4) We are told thatX ∼ Bin(n = 50, p = 0.52)

and the normal approximation of this distribution is N(µ = np, σ2 = np(1 − p)), so X isapproximated by Y where

Y ∼ N(µ = 26, σ2 = 12.48).

46


As we are approximating a discrete distribution with a continuous distribution, we must applythe continuity correction so that Pr(X > 30) = Pr(Y > 29.5).

As in the previous question, we must standardise so we can use the normal tables. Again letZ ∼ N(0, 1), consider the symmetric property Φ(−z) = 1−Φ(z) and note that Pr(Y 6 y) =Pr(Y < y) when we consider continuous distributions. Then

Pr(Y 6 29.5) = Pr

(Z 6

29.5− µσ

)= Φ

(29.5− 26√

12.48

)= Φ(0.9907 . . .).

Using interpolation (as described beside the normal table):

Φ(0.9907...) ≈ 0.8289 +

(0.9907 . . .− 0.95

1− 0.95

)(0.8413− 0.8289) = 0.8390.

Therefore Pr(Y > 29.5) = 1− 0.8390 = 0.161.

(7.1) HereX ∼ U(0, 1), with y = − log(x/λ) (a monotonic transformation) so x = e−λy, |dx/dy| =| − λe−λy| = λe−λy, hence fY (y) = λe−λy, y ≥ 0 therefore Y ∼ exp(λ), that is Y has anexponential distribution with parameter λ.

(7.2) Start with the moment generating function of the normal random variable Xi, MXi(t) =exp

µit+ 1

2σ2i t

2. The moment generating function of Sk = X1 + ...+Xn is then

MSk(t) =k∏i=1

MXi(t) =k∏i=1

exp

µit+

1

2σ2i t

2

= exp

(∑

µi)t+1

2(∑

σ2i )t

2

This is the moment generating function of a normal random variable with mean

∑µi and

variance∑σ2i , hence Sk ∼ N(

∑µi,∑σ2i ).

If now the random variables have equal mean and variance then this result becomes henceSk ∼ N(nµ, nσ2). Then, again using Result 4 in Section 3.2 with a = 1/n and b = 0, we have

MX(t) = exp0× t ×MSk(t/n) = exp

nµ

t

n+

1

2nσ2

(t

n

)2

= exp

µt+

1

2

σ2

nt2

which is the MGF of a normal random variable with mean µ and variance σ2/n hence X ∼N(µ, σ2/n).

(8.1) With a single observation x from a Poisson distribution, l(θ) = f(x|θ) = θxe−θ/x! and theprior distribution of θ is Gamma(a, b), π(θ) = baθa−1e−bθ/Γ(a) θ > 0.

Therefore, the posterior distribution of θ|x is

π(θ|x) =f(x|θ) π(θ)

f(x)=

f(x|θ)π(θ)∫f(x|θ) π(θ)dθ

substituting gives

π(θ|x) =

baθa−1e−bθθxe−θ

Γ(a)x!∫baθa−1e−bθθxe−θ

Γ(a)x!dθ

=θx+a−1e−(b+1)θ∫θx+a−1e−(b+1)θdθ

47


Note that the denominator (and numerator) are almost Gamma distribution – only the normal-ising constants are missing - with parameters a+x and b+ 1. Adding the appropriate constantsgives

π(θ|x) =(b+ 1)x+aθx+a−1e−(b+1)θ

Γ(x+ a)∫ (b+1)x+aθx+a−1e−(b+1)θ

Γ(x+a)dθ

The integral in the denominator is that of a Gamma pdf over its full range and so have value 1.Hence,

π(θ|x) =(b+ 1)x+aθx+a−1e−(b+1)θ

Γ(x+ a)

that is a Gamma(a+ x, b+ 1) distribution.

As the prior and posterior are both Gamma distributions, we have a conjugate prior here.

For a Gamma(α, λ) the posterior mean is α/λ and the MAP (the posterior mode) is (α− 1)/λ.Substituting in α = a + x and λ = b + 1 gives: the posterior mean (a + x)/(b + 1) and MAP(a+ x− 1)/(b+ 1).

With values given, the prior is Gamma(2, 0.7)and the posterior Gamma(5, 1.7). These giveestimates: 2.9 and 2.4 respectively. The graphshows that the posterior density (dashed line)is more concentrated than the prior (solid line),and that the mean and mode have increased,compared to the prior values, due to the higherdata value.

0 2 4 6 8 10

0.00

0.10

0.20

0.30

x

Den

sity

(8.2) For this example the prior is Beta with pdf π(θ) = θα−1(1 − θ)β−1/B(α, β) 0 < θ < 1 andthe data has a Binomial distribution: X|θ ∼ Binomial(n, θ)

notice that this is almost the same as one of the class examples, and hence here we will take theapproach of only looking at the functional form of the posterior, that is ignoring constants.

So the posterior is:π(θ|x) ∝ f(x|θ) π(θ) ∝

(n

x

)θx(1− θ)n−xθα−1(1− θ)β−1 ∝ θx+α−1(1− θ)n−x+b−1

Thus, θ|x ∼ Beta(x+ α, n− x+ β).

The mean of a Beta(α, β) distribution isα

α + βand therefore, the posterior mean is

E[θ|x] =x+ α

n+ α + β

Given n = 25, x = 8, and that prior has mean 12, standard deviation 1

4.

For Y ∼ Beta(α, β),

E[Y ] =α

α + βand V ar[Y ] =

αβ

(α + β)2(α + β + 1)

48


Therefore,α

α + β=

1

2and

αβ

(α + β)2(α + β + 1)=

1

16

From the first of these, α = β. Substituting this into the second gives,

16α2 = 4α2(2α + 1) ⇒ 4 = 2α + 1 (α 6= 0)

Thus, α = 32

= β.

For the above values, the posterior mean is

µ = E[θ|x] =x+ α

n+ α + β=

8 + 32

25 + 3=

19

56= 0.3393

An estimate of the precision of µ is obtained by calculating the standard deviation of θ|x. Ifthe standard deviation is small, then µ is a precise estimate whereas if the standard devation islarge, then µ is not a precise estimate.

Here,

V [Y ] =αβ

(α + β)2(α + β + 1)=

(8 + 32)(17 + 3

2)

28× 28× 29= 0.007730

Thus, the standard deviation is 0.0879.

(8.3) Here the prior is Gamma, θ ∼ γ(a, b) with pdf

π(θ) =baθa−1e−bθ

Γ(a)θ, a, b > 0,

and the data has an exponential distribution: X|θ ∼ exp(θ) with pdf

f(x|θ) = θ exp−θx x ≥ 0, θ > 0.

Again we will take the approach of only looking at the functional form of the posterior, that isignoring constants.

So the posterior is:π(θ|x) ∝ f(x|θ) π(θ)

∝ θe−θxbaθa−1e−bθ

Γ(a)∝ θ(a+1)−1e−(b+x)θ

π(θ|x) =(b+ x)a+1θ(a+1)−1e−(b+x)θ

Γ(a+ 1)

Thus, θ|x ∼ γ(a+ 1, b+ x).

Recall that for a γ(α, β), the mean is α/β, the mode is (α− 1)/β and the variance is α/β2.

Therefore, the posterior mean is: θ = a+1b+x

. With the given numbers this is

(10 + 1)/(1.5 + 4.8) = 1.746,

and the MAP estimate is:

θ = [(a− 1) + 1]/(b+ x) = 10/(1.5 + 4.8) = 1.587.

The posterior standard deviation is:√a+ 1/(b+ x) =

√10 + 1/(1.5 + 4.8) = 0.526

which is large compared to the estimates so the estimates are not precise.

49

INTRODUCING PROBABILITY AND STATISTICS STANDARD DISTRIBUTIONS

1. A Bernoulli random variable, X , with parameter θ has probability mass function

p(x; θ) = θx(1− θ)1−x x = 0, 1 (0 < θ < 1),

and mean and variance E[X] = θ and Var[X] = θ(1− θ).

2. A geometric random variable, X , with parameter θ has probability mass function

p(x; θ) = θ(1− θ)x−1 x = 1, 2, . . . (0 < θ < 1),

and mean and variance E[X] = 1/θ and Var[X] = (1− θ)/θ2.

3. A negative binomial random variable, X , with parameters r and θ has probability mass

function

p(x; r, θ) =

(x− 1

r − 1

)θr(1− θ)x−r x = r, r + 1, . . . (r > 0 and 0 < θ < 1),

and mean and variance E[X] = r/θ and Var[X] = r(1− θ)/θ2.

4. A binomial random variable,X , with parameters n and θ (where n is a known positive integer

has probability mass function

p(x;n, θ) =

(n

x

)θx(1− θ)n−x x = 0, 1, . . . , n (0 < θ < 1),

and mean and variance E[X] = nθ and Var[X] = nθ(1− θ).

5. A Poisson random variable, X , with parameter θ has probability mass function

p(x; θ) =θxe−θ

x!x = 0, 1, . . . (θ > 0),

and mean and variance E[X] = θ and Var[X] = θ.

6. A uniform random variable, X , with parameter θ has probability density function

f(x; θ) =1

θ0 < x < θ, (θ > 0),

and mean and variance E[X] = θ/2 and Var[X] = θ2/12.

7. An exponential random variable, X , with parameter λ has probability density function

f(x;λ) = λe−λx x > 0 (λ > 0),

and mean and variance E[X] = 1/λ and Var[X] = 1/λ2.

8. A normal random variable, X , with parameters µ and σ2 has probability density function

f(x;µ, σ2) =1√

2πσ2exp

− 1

2σ2(x− µ)2

−∞ < x <∞ (−∞ < µ <∞, σ2 > 0),

and mean and variance E[X] = µ and Var[X] = σ2.

50

INTRODUCING PROBABILITY AND STATISTICS STANDARD DISTRIBUTIONS

9. A gamma random variable, X , with parameters α and β has probability density function

f(x;α, β) =βαxα−1e−βx

Γ(α)x > 0 (α, β > 0),

where Γ(α) =∫∞

0xα−1e−xdx, and mean and variance E[X] = α/β and Var[X] = α/β2. Note

that Γ(α + 1) = αΓ(α) for all α and Γ(α + 1) = α! for integers α > 1. Also Γ(1/2) =√π.

10. A beta random variable, X , with parameters α and β has probability density function

f(x;α, β) =xα−1(1− x)β−1

B(α, β)0 < x < 1 (α, β > 0),

whereB(α, β) =∫ 1

0xα−1(1−x)β−1dx = Γ(α)Γ(β)/Γ(α+β), and mean and variance E[X] =

α/(α + β) and Var[X] = αβ/(α + β)2(α + β + 1).

11. A Pareto random variable, X , with parameters θ and α has probability density function

f(x; θ, α) =αθα

xα+1x > θ (θ, α > 0),

and mean and variance E[X] = αθ(α−1)

and Var[X] = αθ2

(α−1)2(α−2)(α > 2).

12. A chi-square random variable, X , with degrees of freedom parameter n (n is a positive inte-

ger) has probability density function

f(x;n) =

(1

2

)n2 x

n2−1e−

x2

Γ(n2)

x > 0

and mean and variance E[X] = n and Var[X] = 2n.

13. A Student’s t random variable, X , with degrees of freedom parameter n (n is a positive

integer) has probability density function

f(x;n) =Γ(n+1

2)

√nπ Γ(n

2)[1 + x2

n

]n+12

−∞ < x <∞

and mean and variance E[X] = 0 (n > 1) and Var[X] = n/(n− 2) (n > 2).

14. An F random variable, X , with degrees of freedom parameters m and n (m,n are positive

integers) has probability density function

f(x;m,n) =(mn

)m2 Γ(m+n

2) x

m2−1

Γ(m2

)Γ(n2)[1 + mx

n

]m+n2

x > 0

and mean and variance E[X] = nn−2

(n > 2) and Var[X] = 2n2(m+n−2)m(n−2)2(n−4)

(n > 4).

51

INTRODUCING PROBABILITY AND STATISTICS NORMAL TABLES

Normal Distribution Function Tables

The first table gives

Φ(x) =1√2π

∫ x

−∞e−

12y2dy

and this corresponds to the shaded area in the figureto the right. Φ(x) is the probability that a randomvariable, normally distributed with zero mean and unitvariance, will be less than or equal to x. When x < 0use Φ(x) = 1−Φ(−x), as the normal distribution withmean zero is symmetric about zero. For interpolationuse the formula

Φ(x) ≈ Φ(x1) +x− x1

x2 − x1

(Φ(x2)− Φ(x1)

)(x1 < x < x2)

Table 1x Φ(x) x Φ(x) x Φ(x) x Φ(x) x Φ(x) x Φ(x)

0.00 0.5000 0.50 0.6915 1.00 0.8413 1.50 0.9332 2.00 0.9772 2.50 0.99380.05 0.5199 0.55 0.7088 1.05 0.8531 1.55 0.9394 2.05 0.9798 2.55 0.99460.10 0.5398 0.60 0.7257 1.10 0.8643 1.60 0.9452 2.10 0.9821 2.60 0.99530.15 0.5596 0.65 0.7422 1.15 0.8749 1.65 0.9505 2.15 0.9842 2.65 0.99600.20 0.5793 0.70 0.7580 1.20 0.8849 1.70 0.9554 2.20 0.9861 2.70 0.99650.25 0.5987 0.75 0.7734 1.25 0.8944 1.75 0.9599 2.25 0.9878 2.75 0.99700.30 0.6179 0.80 0.7881 1.30 0.9032 1.80 0.9641 2.30 0.9893 2.80 0.99740.35 0.6368 0.85 0.8023 1.35 0.9115 1.85 0.9678 2.35 0.9906 2.85 0.99780.40 0.6554 0.90 0.8159 1.40 0.9192 1.90 0.9713 2.40 0.9918 2.90 0.99810.45 0.6736 0.95 0.8289 1.45 0.9265 1.95 0.9744 2.45 0.9929 2.95 0.99840.50 0.6915 1.00 0.8413 1.50 0.9332 2.00 0.9772 2.50 0.9938 3.00 0.9987

The inverse function Φ−1(p) is tabulated below for various values of p.

Table 2

p 0.900 0.950 0.975 0.990 0.995 0.999 0.9995Φ−1(p) 1.2816 1.6449 1.9600 2.3263 2.5758 3.0902 3.2905

52

introducing probability and statisticssta6ajb/notes-sb.pdf · 2015. 9. 23. · introducing...

Documents