3e1 - probability notes
DESCRIPTION
probabilityTRANSCRIPT
Lecture Notes
on
Probability and Statistics
Joe O hOgain
E-mail: [email protected]
Main Text: Kreyszig; Advanced Engineering Mathematics
Other Texts: Schaum Series, Robert B. Ash, Hayter
Online Notes: Hamilton.ie EE304, Prof. Friedman Lec-
tures
1
Probability Function
Definition: An experiment is an operation with well-defined
outcomes.
Definition: The sample space S of an experiment is set of
all possible outcomes.
Examples: Toss a coin: S = H,T
Throw a die: S = 1, 2, 3, 4, 5, 6
Toss a coin twice: S = HH,HT, TH, TT.
Toss a coin until H appears and count the number of times it
is tossed: S = 1, 2, 3, ...,∞, where ∞ means that H never
appears.
Definition: Any subset of S is called an event.
Let P(S) be the set of all events in S i.e. the collection of all
subsets of S.
Definition: A probability function on S is a function
P : P(S) −→ [0, 1] such that
(i) P (S) = 1
(ii) P (A1A2...An...) = P (A1) + P (A2) + ... + P (An) + ...,2
where A1.A2, ..., An, ... are mutually exclusive i.e.AiAj = φ
for all i 6= j.
Theorem 1: P (φ) = 0.
Proof: For any AS,Aφ = A and Aφ = φ, so P (A) =
P (Aφ) = P (A) + P (φ); hence P (φ) = 0.
Theorem 2: P (Ac) = 1− P (A).
Proof: AAc = S and AAc = φ, so 1 = P (S) = PAAc =
P (S) + P (Ac).
Theorem 3: P (A) ≤ 1, for all AS.
Proof: From Theorem we get 2 P (A) = 1− P (Ac) ≤ 1.
Theorem 4: AB ⇐⇒ P (A) ≤ P (B).
Proof: B = A(B − A) ⇐⇒ P (B) = P (A) + P (B − A)
and P (B − A) ≥ 0.
3
Finite Sample Space
An event containing one element is a singleton. If S contains
n elements x1, x2, ..., xn say, and each one has the same prob-
ability p of occuring, then 1 = P (S) = P (x1, x2, ..., xn) =
P (x1x2...xn) = P (x1)+P (x2)+ ...+P (xn) =
p + p + ... + p = np, so p = 1n. Then, for any AS we have
P (A) = |A||S| , where |A| means the number of elements of A.
Conversely, if we define P on S by this formula, then it is easy
to check that P gives a probability function on S.
Example: A card is selected from a pack of 52 cards. Let
A = hearts , B = face cards . Then P (A) = 1352, P (B) =
1252, P (AB) = 3
52 and P (AB) = 1352 + 12
52 −352 = 22
52, using
Theorem 6 ( or just count them).
In general singletons need not be equiprobable. Then let P (xi) =
pi for 1 ≤ i ≤ n. (We write P (xi) for P (xi) for con-
venience). We have P (A) =∑xi∈A
P (xi) and 1 = P (S) =
n∑i=1
P (xi.) We can form a table, called a probability distribu-
tion table4
where pi = P (xi) for all 1 ≤ i ≤ n. Again, going backwards,
the table defines a probability function on S.
Example: Three horses A,B,C race against each other. A is
twice as likely to win as B and B is twice as likely to win as C.
Assuming no deadheats find P (A), P (B), P (C) and P (AC).
Let P (C) = p. Then P (B) = 2p, P (A) = 4p. Hence p+ 2p+
4p = 1, so 7p = 1 or p = 17. Then P (C) = 1
7, P (B) = 27 and
P (A) = 47. Also P (BC) = P (B) + P (C) = 3
7.
5
Countably Infinite Sample Space
In this case S = x1, x2, ..., xn, ... with P (xi) = pi for all
i ≥ 1. Then 1 =∞∑i=1
pi.
Example: Toss a coin until H appears and count the num-
ber of times it is tossed: S = 1, 2, 3, ...,∞, where∞ means
that H never appears. Set P (1) = 12, P (2) = 1
4, ..., P (n) = 12n
for all n. (P (∞) = 0.)
Let A = 1, 2, 3. Then P (A) = 12 + 1
4 + 18 = 7
8 = the proba-
bility of H in the first 3 throws.
If B = 2, 4, 6, ..., then the probability of H on an even throw
is P (B) = 122
+ 124
+ 126
+ ... =122
1− 122
= 13.
Note that P (S) = 12+ 1
22+ 1
24+... =
12
1−12
= 1 and the probability
of H on an odd throw is 23.
6
Conditional Probability
Let A,E be events in S with P (E) 6= 0.
Definition: The conditional probability of A given E is de-
fined by P (A|E) = P (A∩E)P (E) .
Example: If S is a finite equiprobable space, then P (A|E) =
P (A∩E)P (E) =
|A∩E||S|E||S|
= |A∩E||E| .
Example: Pair of dice. S = (1, 1), (1, 2), ..., (6, 6). Find
the probability that one die shows 2 given that the sum is
6. Let A be the probability that one die is 2 and E be the
probability that the sum is 6. Then P (A|E) = 25.
Note that, in general, P (A ∩ E) = P (E)P (A|E).
Example: Let A,B be events with P (A) = ·6, P (B) = ·3
and P (A ∩B) = ·2. Then
P (A|B) = ·2·3, P (B|A) = ·2
·6, P (A ∪ B) = ·6 + ·3 − ·2 =
·7, P (Ac) = 1 − ·6 = ·4, P (Bc) = 1 − ·3 = ·7, P (Ac ∩ Bc) =
P ((A ∪B)c) = 1− ·7 = ·3, P (Ac|Bc) = ·3·7, P (Bc|Ac) = ·3
·4.
Example: The probability that a certain flight departs on
time is ·8 and the probability that it arrives on time is ·9. The7
probability that it both departs and arrives on time is ·78.
Find the probability that
(i) it arrives on time given that it departed on time,
(ii) does not arrive on time given that it did not depart on
time.
The sample space may be taken as (D,A), (D,Ac), (Dc, A), (Dc, Ac).
Let B = (D,A), (D,Ac) and let C = (D,A), (Dc, A)).
Then P (B) = ·8, P (C) = ·9 and
(i) P (C|B) = ·78·8 ,
(ii) P (Cc|Bc) = P (Cc∩Bc)P (Bc) = P (C∪B)c
P (Bc) = 1−P (C∪B)1−P (B) = 1−(·8+·9+·78)
1−·8 .
Suppose that S = A1∪A2∪ ...∪An, where Ai∩Aj = ∅ for all
i 6= j. We say that the Ai are mutually exclusive and form a
partition of S. Let E ⊆ S. Then E = E∩S = E∩(A1∪A2∪
...∪An) = (E ∩A1)∪ (E ∩A2)∪ ...∪ (E ∩An), disjoint, So
P (E) = P (E∩A1)+P (E∩A2)+...P (E∩An) =n∑i=1
P (E∩Ai).
Now P (E ∩ Ai) = P (Ai)P (E|Ai), for each i, so
P (E) =n∑i=1
P (Ai)P (E|Ai). This is called the Law of Total
Probability. We also have P (Aj|E) =P (Aj∩E)p(E) =
P (E∩Aj)p(E) =
8
P (Aj)P (E|Aj)P (E) =
P (Aj)P (E|Aj)n∑i=1
P (Ai)P (E|Ai)for all j. This is known as Bayes’
Formula or Theorem. We use it if we know all the P (E|Ai).
Example: Three machines X, Y, Z produce items.X pro-
duces 50%, 3% of which are defective. Y produces 30%, 4%
of which are defective and Z produces 20%, 5% of which are
defective. Let D be the event that an item is defective. Let
an item be chosen at random.
(i) Find the probability that it is defective.
(ii) Given that it is defective, find the probability that it
came from machine X .
Let A1 be the event consisting of elements of X , let A2 be
the event consisting of elements of Y and let A3 be the event
consisting of elements of Z. Then P (A1) = ·5, P (A2) = ·3
and P (A3) = ·2. Also P (D|A1) = ·03, P (D|A2) = ·04 and
P (D|A3) = ·05.
(i) P (D) = (·5)(·03) + (·3)(·04) + (·2)(·05) = ·037 = 3 · 7%.
(ii) P (A1|D) = P (A1)P (D|A1)P (D) = (·5)(·03)
·037 = ·405 = 40 · 5%.
We often use a ”tree diagram”:
9
Example: A hospital has 300 nurses. During the past year
48 of the nurses got a pay rise. At the beginning of the year
the hospital offered a training seminar which was attended by
138 of the nurses. 27 of the nurses who got a pay rise attended
the seminar. What is the probability that a nurse who got a
pay rise attended the seminar?
Let A be the event consisting of nurses who attended the sem-
inar and Let B be the event consisting of nurses who got a
pay rise. Then P (A) = 138300 and P (Ac) = 162
300. Also P (B|A) =
27138, P (Bc|A) = 111
138, P (B|Ac) = 21162 and P (Bc|Ac) = 141
300.
Therefore P (B) = (138300)( 27138) + (162300)( 21
162) = 48300, which is
obvious from the beginning. Also P (A|B) = P (A)P (B|A)P (B) =
(138300)(27138)
48300
= 2748.
10
Exercise: In a certain city 40% vote Conservative, 35%
vote Liberal and 25% vote Independent. During an election
45% of conservative, 40% of Liberal and 60% of Independent
voted. A person is selected at random. Find the probability
that the person voted. If the person voted, find the probability
that the voter is (i) Con. (ii) lib. and (iii) Ind.
11
Independent Events
Definition Two events A,B ⊆ S are independent if
P (A ∩B) = P (A)P (B).
IfA andB are independent then P (A|B) = P (A∩B)P (B) = P (A)P (B)
P (B) =
P (A) i.e. the conditional probability of A given B is the same
as the probability of A. The converse is obviously true.
Note: A,B are mutually exclusive if and only if A∩B = ∅,
if and only if P (A ∩B) = 0, hence A,B are not independent
unless either P (A) = 0 or P (B) = 0.
Example: Pick a card. let A be the event consisting of
hearts and let B be the event consisting of face-cards. Then
P (A) = 1352, P (B) = 12
52 and P (A∩B) = 352. Hence P (A∩B) =
P (A)P (B) and so A and B are independent events.
Example: Toss a fair coin three times.
S = HHH,HHT,HTH,HTT, THH, THT, TTH, TTT
Let A be the event where the first toss is heads, B be the event
where the second toss is heads and let C be the event where
there are exactly two heads in a row. Then P (A) = 48, P (B) =
12
48, P (C) = 2
8, P (A ∩B) = 28, P (A ∩ C) = 1
8, P (B ∩ C) = 28.
Hence A,B are independent, A,C are independent but B,C
are not independent.
Example: The probability that A hits a target is 14. The
probability that B hits the target is 25. Assume that A and B
are independent. What is the probability that either A or B
hits the target?
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = P (A) + P (B) −
P (A)P (B) = 14 + 2
5 −14 ×
25 = 11
20.
Exercise: Show that A,B independent ⇐⇒ A,Bc inde-
pendent and hence Ac, Bc independent.
13
Product Probability and Independent Trials
Let S = a1, a2, ..., as and T = b1, b2, ..., bt be the sample
spaces for two experiments. LetPS(ai) = pi and PT (bj) = qj
for all 1 ≤ i ≤ s and 1 ≤ j ≤ t, where PS and PT are the
probability functions on S and T respectively.
Let S×T = (ai, bj)|ai ∈ S, bj ∈ T. Define a function P on
P(S × T ) by P ((ai, bj)) = piqj and addition. Then P is a
probability function on S × T :
(i) piqj ≥ 0 for all i, j.
(ii) P (S×T ) = p1q1+p1q2+...+p1qt+...+psq1+psq2+...+
psqt = p1(q1+ ...+qt)+ ...+ps(q1+ ...+qt) = p1+ ...+ps = 1.
(iii) Obvious by addition.
P above is called the product probability on S × T . It is not
the only probability function on S × T. We can extend this
definition to the product of any finite number of sample spaces.
Suppose that S × T has the product probability P .
Let A = ai × T and B = S × bj.
Then P (A) = P (ai × T ) = PS(ai)× PT (T ) = pi × 1 = pi14
and P (B) = P (S × bj) = PS(S) × PT (bj) = qj. Now
A ∩ B = (ai, bj) so P (A ∩ B) = P(ai, bj) = piqj =
P (A)P (B) and hence A and B are independent. Similarly,
any two events of the form C×T and S×D are independent,
where C ⊆ S and D ⊆ T.
Conversely, suppose that P is a probability function on S×T
such that P (ai × T ) = pi and P (S × bj) = qj for all
ai ∈ S and bj ∈ T and all sets of this form are independent.
Then P(ai, bj) = P ((ai × T ) ∩ (S × bj)) =
P (ai × T ) × P (S × bj) = piqj, so that P must be the
product probability.
We deduce that the product probability is the unique proba-
bility on S × T with these two ”independence properties”.
Example: When three horses A,B,C race against each
other their respective probabilities of winning are always 12,
13
and 16. Suppose they race twice. Then, assuming independence,
the probability of C winning the first race and A winning the
second race is 16 ×
12 = 1
12 etc.
15
Now suppose that we perform the same experiment a number
of times. The sample space S × S × ...× S consists of tuples.
If we assume that the experiments are independent, then the
probability function on this sample space is the product prob-
ability. If we do it n times we say that we have n independent
trials.
Example: Toss a coin three times as before. Now e.g.
P (HTH) = 12 ×
12 ×
12 = 1
8 etc. This is the same probability
function as assuming that all triples are equiprobable, as be-
fore. Hence we can consider the problem in either of the two
ways.
16
Counting Techniques
Suppose we have n objects. How many permutations of size
1 ≤ r < n can be made?
Using the Fundamental Principle of counting gives n!(n−r)!, which
is written nPr.
How many combinations of size 1 ≤ r < n can be made?
Let the answer be nCr. Each of these combinations gives r!
permutations, so nCr × r! = nPr. Hence nCr =nPrr! = n!
(n−r)!r!.
If r = 0 or r = n the answer is 1, so if we agree tthat 0! = 1,
we can use the formula for all 0 ≤ r ≤ n.
A lot of problems in finite probability can be done from ”first
principles” i.e. using ”boxes”.
Example: The birthday problem: How many people do
we need to ensure that the probability of at least two having
the same birthday is greater than 12?
Let the answer be n. Then 1 − 365Pn(365)n >
12, so n = 23. For
more on perms. and combs. see Schaum Series.17
Random Variables
Definition: Let S be a sample space with probability func-
tion P . A random variable (R.V.) is a function X : S −→ R.
The image or range of X is denoted by RX . If RX is finite
we say that X is a finite or finitely discrete R.V., if RX is
countably infinite we say that X is a countably infinite or
infinitely discrete R.V. and if RX is uncountable we say that
X is an uncountable or continuous R.V.
Example: (i) Throw a pair of dice. S = (1, 1), ..., (6, 6).
Let X : S −→ R be the maximum number of each pair and
let Y : S −→ R be the sum of the two numbers. Then RX =
1, 2, 3, 4, 5, 6 and RY = 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. X
and Y are finite R.V.s
(ii) Toss a coin untilH appears. S = H,TH, TTH, TTTH, ....
Let X : S −→ R be the number of times the coin is tossed
or (the number of T s) +1. Then RX = 1, 2, 3, ...∞. X is a
countably infinite R.V.
(iii) A point is chosen on a disc of radius 1.18
S = (x, y)|x2 + y2 ≤ 1. Let X −→ R be the distance of
the point from the centre (0, 0). Then RX = [0, 1] and X is a
continuous R.V.
19
Finite Random Variables
Let X : S −→ RX be finite with RX = x1, x2, ..., xn say.
X induces a function f on RX by f (xk) = P (X = xk) =
p(s ∈ S|X(s) = xk). f is called a probability distribution
function (p.d.f.). Note that f (xk) ≥ 0 andn∑k=1
f (xk) = 1.
We can extend f to all of R by defining f (x) = 0 for all
x 6= x1, x2, ..., xn. We often write f using a table:
Recall the idea of a discrete frequency distribution:
If we let f (xk) = fkn∑i=1
fi
, the relative frequency, we get a proba-
bility distribution. Its graph is a bar-chart:
Note: We are usually interested in the distribution of a
particular R.V. rather than the underlying sample space e.g.20
the heights or ages of a given set of people.
Example: X, Y from example (i) above:
X
Y
Exercise: A fair coin is tossed three times. Let X be the
number of heads. RX = 0, 1, 2, 3. Draw the distribution
table for X .
21
Definition: Let
be the probability distribution of a R.V. X . The expecta-
tion or mean of X is defined by
E(X) = x1f (x1) + x2f (x2) + ... + xnf (xn) =n∑k=1
xkf (xk).
This is the same definition as the mean in the case of a fre-
quency distribution.
Example: X, Y from (i) above again:
E(X) = 1× 136 + 2× 3
36 + ... + 6× 1136 = 4 · 47,
E(Y ) = 2× 136 + 3× 2
36 + ... + 12× 136 = 7.
Note that E(X) need not belong to RX .
We write µX or simply µ, if there is no confusion, for E(X).
E(X) is the weighted average where the weights are the prob-
abilities. We can apply this definition to games of chance:
a game of chance is an experiment with n outcomes, a1, a2, ..., an
and corresponding probabilities p1, p2, ..., pn. Suppose the pay-
out for each ai is wi. We define a R.V. X by X(ai) = wi.
22
Then the average payout is E(X) =n∑i=1
wipi. We would play
the game if E(X) > 0.
Example: A fair die is thrown. If 2, 3 or 5 occur we win
that number of euros. If 1, 4 or 6 occur we lose that number
of euros. Should we play?
We have a distribution table
Then E(X) = 2× 16 +3× 1
6 +5× 16−1× 1
6−4× 16−6× 1
6 = −16.
Don’t play!
Definition X a finite R.V. The variance ofX is defined by
var(X) = E((X−µ)2) = (x1−µ)2f (x1)+(x2−µ)2f (x2)+...+
(xn − µ)2f (xn) =n∑k=1
(xk − µ)2f (xk). The standard deviation
of X is defined to be σX =√var(X). If X is understood, we
just write σ.
Note: var(X) = E((X − µ)2) = σ2 =n∑k=1
(xk − µ)2f (xk)
=n∑k=1
(x2k − 2xkµ + µ2)f (xk)23
=n∑k=1
x2kf (xk)− 2µn∑k=1
xkf (xk) + µ2n∑k=1
f (xk)
= E(X2)− 2µ2 + µ2 = E(X)2 − µ2 = E(X2)− (E(X))2.
Example: X, Y from (i) above again:
E(X2) =6∑
k=1
x2kf (xk) = 12× 136+22× 3
36+...+62× 1136 = 21·97.
Hence var(X) = E(X2)− (E(X))2 = 21 ·97−20 ·25 = 1 ·99.
E(Y 2) =11∑k=1
y2kf (yk) = 22× 136+32× 2
36+...+122× 136 = 54·8.
Hence var(Y ) = E(Y 2)− (E(Y ))2 = 54 · 8− 49 = 5 · 8.
Theorem: If X is a finite R.V. and a, b ∈ R, then
(i) E(aX) = aE(X),
(ii) E(X + b) = E(X) + b.
Proof: (i) E(aX) =n∑k=1
axkf (xk) = an∑k=1
xkf (xk) =
aE(X).
(ii) E(X+b) =n∑k=1
(xk+b)f (xk) =n∑k=1
xkf (xk)+bn∑k=1
f (xk) =
E(X) + b.
Hence E(aX + b) = E(aX) + b = aE(X) + b.
Theorem: If X is a finite R.V. and a, b ∈ R, then
(i) var(aX) = a2var(X),
(ii) var(X + b) = var(X).
24
Proof: (i) var(aX) = E((aX)2)−(E(aX))2 = E(a2X2)−
(aE(X))2 = a2E(X2)−a2(E(X))2 = a2(E(X2)−(E(X))2 =
a2var(X). We could also just use the definition of var(X).
(ii) var(X + b) = E((X + b)2)− (E(X + b))2
=n∑k=1
(xk + b)2f (xk)− (E(X) + b)2
=n∑k=1
(x2k + 2bxk + b2)f (xk) − (E(X) + b)2 =n∑k=1
x2kf (xk) +
2bn∑k=1
xkf (xk) + b2n∑k=1
f (xk)− (E(X))2 − 2bE(X)− b2
= E(X2) + 2bE(X) + b2 − (E(X))2 − 2bE(X)− b2
= E(X2)− (E(X))2 = var(X).
Hence var(aX + b) = var(aX) = a2var(X) and σaX+b =
|a|σX .
Definition: If X is a R.V. with mean µ and standard
deviation σ, then the standardized R.V. associated with X is
defined as Z = X−µσ .
E(Z) = E(X−µσ ) = E(X)−µσ = µ−µ
σ = 0 and var(Z) =
1σ2var(X) = σ2
σ2= 1.
So Z has mean 0 and standard deviation 1.
25
Definition: Let X be a finite R.V. with p.d.f. f (x). The
function F : R −→ R defined by F (x) = P (X ≤ x) =∑xk≤x
f (xk) is called the cumulative distribution function (c.d.f.)
of X .
Example: Suppose X is defined by
Then F (−2) = 14, F (1) = 3
8, F (2) = 78, F (4) = 1.F (x) is obvi-
ous for all other x.
F is a ”step function”. In general F is always an increasing
function since f (xk) ≥ 0 for all xk.
Note: The c.d.f. is more important for continuous distri-
butions (see later).
26
The Binomial Distribution
One of the most important finite distributions is the binomial
distribution. Suppose we have an experiment with only two
possible outcomes, called success and failure i.e. S = s, f.
Let P (s) = p and P (f ) = q. Then p + q = 1. Each such
experiment is called a Bernouilli trial. Suppose we repeat
the experiment n times and assume that the trials are in-
dependent. The sample space for the n Bernouilli trials is
S×S× ...×S and a typical elelment in the sample space looks
like (a1, a2, ..., an), where each ai = s or f . Define a R.V. X
on this sample space by X(a1, a2, ..., an) = the number of suc-
cesses in (a1, a2, ..., an). Then RX = 0, 1, 2, ..., n. The p.d.f.
of X, f (x), is given by f (0) = qn, f (1) = nC1pqn−1, f (2) =
nC2p2qn−2 etc. In general f (k) = nCkp
kqn−k.
Then f (0)+f (1)+...+f (n) =n C0qn+nC1pq
n−1+nC2p2qn−2+
...+n Ckpkqn−k + ...+n Cnp
n = (p+ q)n = 1n = 1, by the Bi-
nomial Theorem. X is said to have the binomial distribution,
written as B(n, p).27
Example: Afair coin is tossed six times. Success is defined
to be heads. Here n = 6, p = 12, q = 1
2. Find the probability of
getting
(i) exactly two heads,
(ii) at least four heads,
(iii) at least one head.
(i) f (2) = 6C2(12)2(12)4 = 15
64.
(ii) f (4)+f (5)+f (6) = 6C4(12)4(12)2+ 6C5(
12)5(12)1+ 6C6(
12)6 =
2264.
(iii) 1− f (0) = 1− (12)6 = 6364.
Example: The probability of hitting a target at any time
is 13. If we take seven shots, what is the probability of
(i) exactly three hits?
(ii) at least one hit?
(i) f (3) = 7C3(13)3(23)4 = 0 · 26.
(ii) 1− (23)7 = 0 · 94.
Example: Find the number of dice that must be thrown
such that there is a better than even chance of getting at least
28
one six.
Here p = 16, q = 5
6. Let n be the required number. Then
1 − (56)n > 12, so (56)n < 1
2 or n ln(56) < ln(12) i.e. n <ln(12)
ln(56).
Hence n = 4.
Example: If 20% of the bolts produced by a machine are
defective, find the probability that out of 10 bolts chosen at
random,
(i) none
(ii) one
(iii) greater than two
bolts will be defective. Here n = 10, p = ·2, q = ·8.
(i) f (0) = (·8)10,
(ii) f (1) = 10C1(·2)1(·8)9,
(iii) 1− (f (0) + f (1) + f (2)) = 1− ((·8)10+ 10C1(·2)1(·8)9+
10C2(·2)2(·8)8).
29
Continuous Random Variables
Suppose that X is a R.V. on a sample space S whose range RX
is an interval in R or all of R. Then X is called a continuous
R.V. If there exists a piece-wise continuous function
f : R −→ R such that P (a ≤ X ≤ b) =b∫a
f (x)dx for any
a < b, where a, b ∈ S, then f is called the probability distri-
bution function (p.d.f.) or density function for X .
f must satisfy f (x) ≥ 0 for all x and∞∫−∞
f (x)dx = 1. Note
that P (X = a) = P (a ≤ X ≤ a) =a∫a
f (x)dx = 0. We
define E(X) =∞∫−∞
xf (x)dx and var(X) = E((X − µ)2) =
∞∫−∞
(x − µ)2f (x)dx, where µ = E(X). As for the finite case,
we can easily show that var(X) = E(X2) − (E(X))2 =
∞∫−∞
x2f (x)dx− µ2. Again σ2 = var(X).
The cumulative distribution function (c.d.f.) of X is defined
by F (x) =x∫−∞
f (t)dt. Then P (a ≤ x ≤ b) = F (b) − F (a).
Note that The Fundamental Theorem of Calculus implies that
F′(x) = f (x) for all x.
30
Example: X is a R.V. with p.d.f.
f (x) =
x2 , 0 ≤ x < 2
0, x < 0, 2 < x
P (1 ≤ X ≤ 1 · 5) =1·5∫1
x2dx = 5
16. E(X) =∞∫−∞
xf (x)dx =
2∫0
x2
2 dx = 43 and var(X) = E(X2) − (E(X))2 =
2∫0
x3
2 dx −
(43)2 = 29, so σ =
√23 .
If x < 0, then F (x) =x∫−∞
t2dt = 0.
If 0 ≤ x ≤ 2, then F (x) =x∫−∞
t2dt =
x∫0
t2dt = x2
4 .
If 2 < x, then F (x) =x∫−∞
t2dt =
2∫0
t2dt = 1, as we would
expect!
Recall the idea of a grouped frequency distribution:
Example The heights of 1,000 people.
62 − 63 means 62 ≤ h ≤ 63 etc. The relative frequencies31
are 501,000 = ·05 etc.
∑(rel. freqs.) = 1. The graph is now a
histogram:
The area of each rectangle represents the rel. freq. of that
group. The proportion of people of height less than any num-
ber is the sum of the areas up to that number. Joining the
midpoints of the tops of the rectangles gives a ”bell-shaped”
curve. For large populations the (relative) frequency function
approximates to such a curve.
The most important continuous random variable is the normal
distribution whose p.d.f. has a ”bell shape”.
Definition A R.V. is said to be normally distributed if its
p.d.f has the form f (x) = 1σ√2πe−
12(x−µσ )2. We say that X is
N(µ, σ2) if this f (x) is its p.d.f. Note that f (x) is symmetric
32
about x = µ and the bigger the σ the wider the graph of f (x)
is.
The following theorem shows that this f (x) gives a well-defined
p.d.f.
Theorem: (i)∞∫−∞
e−x22 dx =
√2π.
(ii)∞∫−∞
e−12(x−µσ )2dx =
√2πσ.
(iii) If X is N(µ, σ2), then E(X) = µ and var(X) = σ2.
Proof: See tutorial 3.
SupposeX isN(µ, σ2). Its c.d.f. is F (x) = 1σ√2π
x∫−∞
e−12(v−µσ )2dv
and P (a ≤ X ≤ b) = F (b)− F (a) = 1σ√2π
b∫a
e−12(v−µσ )2dv.
These integrals can’t be found analytically, so they are tabu-
lated numerically. This would have to be done for all values of
33
µ and σ. We do so for the case µ = 0 and σ = 1 and then use
the standardized normal R.V. Z = X−µσ .
For µ = 0, σ = 1 we write Z for the R.V. and z for the real
variable. We denote the p.d.f of Z by φ(z) = 1√2πe−
12z
2and its
c.d.f. by Φ(z) = 1√2π
z∫−∞
e−12u
2du.
Consider F (x) = 1σ√2π
x∫−∞
e−12(v−µσ )2dv, the c.d.f. of X . Let
u = v−µσ ; dv = σdu; as v −→ −∞, u −→ −∞ and when
v = x, u = x−µσ .
Therefore F (x) = 1√2π
x−µσ∫−∞
e−12u
2du = Φ(x−µσ ).
Hence P (a ≤ X ≤ b) = F (b)− F (a) = Φ(b−µσ )− Φ(a−µσ ).
From the tables we have: P (µ− σ ≤ X ≤ µ + σ) =
Φ(1)− Φ(−1) = P (−1 ≤ Z ≤ 1) = ·682,
P (µ− 2σ ≤ X ≤ µ + 2σ) =
Φ(2)− Φ(−2) = P (−2 ≤ Z ≤ 2) = ·954, and
P (µ− 3σ ≤ X ≤ µ + 3σ) =
Φ(3)− Φ(−3) = P (−3 ≤ Z ≤ 3) = ·997, etc.
34
The tables give values of Φ(z) for 0 ≤ z ≤ 3, in steps of 0 · 01.
Φ(z) = P (−∞ < Z ≤ z) and Φ(0) = P (−∞ < Z ≤ 0) = ·5.
Then Φ(z) = 1− Φ(−z) for z < 0.
Suppose that a < b.
(i) 0 ≤ a < b.
P (a ≤ Z ≤ b) = Φ(b)− Φ(a).
(ii) a < 0 < b.
P (a ≤ Z ≤ b) = Φ(b)− Φ(a)
= Φ(b)− (1− Φ(−a))
= Φ(b) + Φ(−a)− 1.
(iii) a < b < 0.
P (a ≤ Z ≤ b) = Φ(b)− Φ(a)
= 1− Φ(−b)− (1− Φ(−a))
= Φ(−a)− Φ(−b).
Example: For N(0, 1) find
35
(i) P (Z ≤ 2 · 44)
(ii) P (Z ≤ −1 · 16)
(iii) P (Z ≥ 1)
(iv) P (2 ≤ Z ≤ 10)
(i) P (Z ≤ 2 · 44) = Φ(2 · 44) = ·9927.
(ii) P (Z ≤ −1 · 16) = 1− P (Z ≤ 1 · 16) =
1− Φ(1 · 16) = 1− ·877 = ·123.
(iii) P (Z ≥ 1) = 1− P (Z < 1) = 1− Φ(1) = 1− ·8413 =
·1587.
(iv) P (2 ≤ Z ≤ 10) = Φ(10)− Φ(2) = 1− ·9772 = ·0228.
Example: For N(0, 1) find c if
(i) P (Z ≥ c) = 10%
(ii) P (Z ≤ c) = 5%
(iii) P (0 ≤ Z ≤ c) = 45%
(iv) P (−c ≤ Z ≤ c) = 99%.
(i) P (Z ≥ c) = 1−P (Z ≤ c) = 1−Φ(c), so 1−Φ(c) = ·1
or Φ(c) = ·9, giving c = 1 · 282.
(ii) P (Z ≤ c) = Φ(c), so Φ(c) = ·05, giving c = −1 · 645.
36
(iii) P (0 ≤ Z ≤ c) = Φ(c) − ·5, so Φ(c) = ·95, giving
c = 1 · 645.
(iv) P (−c ≤ Z ≤ c) = Φ(c) − Φ(−c) = 2Φ(c) − 1, so
2Φ(c) = 1 · 99 or Φ(c) = ·995, giving c = 2 · 576.
Example: For X = N(·8, 4) find
(i) P (X ≤ 2 · 44)
(ii) P (X ≤ −1 · 16)
(iii) P (X ≥ 1)
(iv) P (2 ≤ X ≤ 10)
(i) P (X ≤ 2·44) = F (2·44) = Φ(2·44−·82 ) = Φ(·82) = ·7939.
(ii) P (X ≤ −1 · 16) = F (−1 · 16) = Φ(−1·16−·82 ) =
Φ(− · 98) = ·1635.
(iii) P (X ≥ 1) = 1−P (X ≤ 1) = 1−F (1) = 1−Φ(1−·82 ) =
1− Φ(·1) = ·4602.
(iv) P (2 ≤ X ≤ 10) = F (10)−F (2) = Φ(10−·82 )−Φ(2−·82 ) =
Φ(4 · 6)− Φ(·6) = ·2743.
Example: Assume that the distance an athlete throws a
shotputt is a normal R.V. X with mean 17m and standard
37
deviation 2m. Find
(i) the probability that the athlete throws a distance greater
than 18 · 5m,
(ii) the distance d that the throw will exceed with 90% prob-
ability.
(i) P (X > 18 · 5) = 1 − P (X ≤ 18 · 5) = 1 − F (18 · 5) =
1− Φ(18·5−172 ) = 1− Φ(·75) = ·2266.
(ii) P (X > d) = ·95, so 1 − F (d) = ·95 or F (d) = ·05.
Hence Φ(d−172 ) = ·05, so d−172 = −1 · 645 giving d = 13 · 88m.
Example: The average life of a stove is 15 years with stan-
dard deviation 2 ·5 years. Assuming that the lifetime X of the
stoves is nomally distributed find
(i) The percentage of stoves that will last only 10 years or
less,
(ii) The percentage of stoves that will last between 16 and
20 years.
(i) P (X ≤ 10) = F (10) = Φ(10−152·5 ) = Φ(−2) = 1−Φ(2) =
·0228 = 2 · 28%.
38
(ii) P (16 ≤ X ≤ 20) = F (20) − F (16) = Φ(20−152·5 ) −
Φ(16−152·5 ) = Φ(2)− Φ(·4) = ·3218.
39
Jointly Distributed Random Variables
LetX, Y be finite R.V.s on the same sample space S with prob-
ability function P. Let the range ofX beRX = x1, x2, ..., xn
and the range of Y be RY = y1, y2, ..., ym respectively.
Consider the pair (X, Y ) defined on S by (X, Y )(s) = (X(s), Y (s)).
Then (X, Y ) is a R.V. on S with range ⊆ RX × RY =
(x1, y1), ..., (x1, ym), (x2, y1), ..., (x2, ym), ..., (xn, y1), ..., (xn, ym).
We sometimes call (X, Y ) a vector R.V.
Let Ai = s ∈ S | X(s) = xi = X = xi and
Bj = s ∈ S | Y (s) = yj = Y = yj. We write
Ai ∩Bj = X = xi, Y = yj. Define a function
h : RX ×RY −→ R by h(xi, yj) = P (Ai ∩Bj)
= P (X = xi, Y = yj). Then h(xi, yj) ≥ 0 and∑i,j
h(xi, yj) =
1, since the Ai∩Bj form a partition of S. h is called the joint
probability distribution function of (X, Y ) associated with the
probability function P and X, Y are said to be jointly dis-
tributed. Suppose that f and g are the p.d.f.s of X and Y
respectively. What is the connection between f, g and h?40
S = ∪mj=1Bj, soAi = Ai∩S = Ai∩(∪mj=1Bj) = ∪mj=1(Ai∩Bj),
disjoint. Therefore f (xi) = P (Ai) = P (∪mj=1(Ai ∩ Bj)) =
m∑j=1
P (Ai ∩ Bj) =m∑j=1
h(xi, yj). Similarly g(yj) =n∑i=1
h(xi, yj).
f and g are sometimes called the marginal distributions of h.
We often write the joint distribution in a table:
Example: Throw a pair of dice. LetX(a, b) = maxa, b
and Y (a, b) = a + b.
41
Definition: X and Y are independent if h(xi, yj) = f (xi)g(yj)
for all i, j.
This means that P (X = xi, Y = yj) = P (X = xi)P (Y = yj)
or P (Ai ∩Bj) = P (Ai)P (Bj) for all i, j.
Note that in the above example X and Y are not independent.
Definition: If G : R2 −→ R, then we define a R.V.
G(X, Y ) on S by G(X, Y )(s) = G(X(s), Y (s)) = G(xi, yj)
with p.d.f. h.
We now define the expectation and variance of G(X, Y ) as
E(G(X, Y )) =∑i,j
G(xi, yj)h(xi, yj) and
var(G(X, Y )) = E(G(X, Y )2)− (E(G(X, Y ))2.
Example: G(x, y) = x + y. Then (X + Y )(s) = X(s) +
Y (s) = xi + yj and E(X + Y ) =∑i,j
(xi + yj)h(xi, yj), etc.
Theorem: (i) For any R.V.sX and Y we haveE(X+Y ) =
E(X) + E(Y ).
(ii) If X and Y are independent, then var(X + Y ) =
var(X) + var(Y ). (This is not true in general).
Proof: (i) E(X + Y ) =∑i,j
(xi + yj)h(xi, yj)
42
=∑i
∑j
xih(xi, yj) +∑j
∑i
yjh(xi, yj)
=∑i
xi∑j
h(xi, yj)+∑j
yj∑i
h(xi, yj) =∑i
xif (xi)+∑j
yjg(yj)
= E(X) + E(Y ).
(ii) First we show that E(XY ) = E(X)E(Y ).
E(XY ) =∑i,j
xiyjh(xi, yj) =∑i,j
xiyjf (xi)g(yj)
= (∑i
xif (xi))(∑j
yjg(yj)) = E(X)E(Y ).
Now var(X + Y ) = E((X + Y )2)− (E(X + Y ))2
= E(X2 + 2XY + Y 2) − (E(X) + E(Y ))2 = E(X2) +
2E(X)E(Y ) + E(Y 2)− (E(X))2 − (E(Y ))2 − 2E(X)E(Y )
= E(X2)− (E(X))2+E(Y 2)− (E(Y ))2 = var(X)+var(Y ).
Important Example: Consider the Binomial Distribu-
tion B(n, p). The sample space is S × S × ... × S, n times,
where S = s, f. For 1 trial, n = 1 define the R.V. X by
X(s) = 1 and X(f ) = 0. Then E(X) = 1 × p + 0 × q = p
and var(X) = E(X2)− (E(X))2 = p− p2 = p(1− p) = pq.
For n trials define X1(a1, a2, ..., an) = X(a1), ...,
Xn(a1, a2, ..., an) = X(an), so that Xi is 1 if s is in the ith
place and 0 if f is in the ith place for all 1 ≤ i ≤ n.
43
Then E(Xi) = E(X) = p and var(Xi) = var(X) = pq.
Now Let Y = X1 + X2 + ... + Xn, so that Y gives the total
number of successes in the n trials. Then E(Y ) = E(X1) +
E(X2) + ... + E(Xn) = p + p + ... + p = np and var(Y ) =
var(X1) + var(X2) + ... + var(Xn) = npq.
44
Sampling Theory
Suppose that we have an infinite or very large finite sample
space S. This sample space is often called a population. Get-
ting information about the total population may be difficult,
so we consider much smaller subsets of the population, called
samples. We want to get information about the population
by studying the samples. We consider the samples to be ran-
dom samples i.e. each element of the population has the same
probability of being in a sample.
Example: Consider the population of Ireland. Pick a per-
son at random and consider the age of this person. Do this
n times. This gives a random sample of size n of the ages of
people in Ireland.
Mathematically the situation is described in the following way:
Let X be a random variable on a sample space S with proba-
bility function P and let f (x) be the probability distribution
function of X . Consider the sample space Ω = S×S× ...×S,
(n times) with the product probability function P i.e.45
P(A1 × A2 × ...× An) = P (A1)P (A2)...P (An).
For each 1 ≤ i ≤ n define a random variable Xi on Ω by
Xi(s1, s2, ..., sn) = X(si) where (s1, s2, ..., sn) ∈ Ω. Then the
probability distribution function of Xi is also f (x) for each
i. The vector random variable (X1, X2, ..., Xn) defined by
(X1, X2, ..., Xn)(s1, s2, ..., sn) = (X1(s1), X2(s2), ..., Xn(sn)) =
(x1, x2, ..., xn) is a random variable on Ω with joint distribution
P(X1 = x1, X2 = x2, ..., Xn = xn) = f (x1)f (x2)...f (xn).
Choosing a sample is simply applying the vector random vari-
able (X1, X2, ..., Xn) to Ω to get a random sample (x1, x2, ..., xn).
Each Xi has the same mean µ and variance σ2 as X and
they are independent, by definition. They are called indepen-
dent identically distributed random variables (i.i.d.). Func-
tions of the X1, X2, ..., Xn and numbers associated with them
are called statistics, while functions of the original X and as-
sociated numbers are called parameters. Our task is to get
information about the parameters by studying the statistics.
The mean µ and variance σ2 of X are called the population
46
mean and variance. We define two important statistics, the
sample mean and sample variance:
Definition: Sample mean X = X1+X2+...+Xnn ,
sample variance S2 =n∑1
(Xi−X)2
n−1 .
Then X(s1, s2, ..., sn) = x1+x2+...+xnn = x = µS and
S2(s1, s2, ..., sn) =n∑1
(xi−x)2n−1 .
We have
Theorem: (i) Expection of X = E(X) = µ, the popula-
tion mean,
(ii) Variance of X = σ2X
= σ2
n , the population variance over
n.
Proof: (i) E(X) = E(X1+X2+...+Xnn ) = E(X1)+E(X2)+...+E(Xn)
n =
µ+µ+...+µn = nµ
n = µ.
(ii) σ2X
= var(X) = var(X1+X2+...+Xnn ) = var(X1
n )+var(X2n )+
... + var(Xnn ) = var(X1)n2
+ var(X2)n2
+ var(Xn)n2
= nσ2
n2= σ2
n .
The reason for the n− 1 instead of n in the definition of S2 is
given by the following result.
47
Theorem: E(S2) = σ2, the population variance.
Proof: E(S2) = E(n∑1
(Xi−X)2
n−1 ) = 1n−1E(
n∑1
(Xi − X)2) =
1n−1E(
n∑1
(X2i − 2XiX + X
2)) =
1n−1[
n∑1E(X2
i )− 2E((n∑1Xi)X) + nE(X
2)] =
1n−1[
n∑1E(X2
i )− 2E((nX)(X)) + nE(X2)] =
1n−1[n(σ2 + µ2)− 2nE(X
2) + nE(X
2)] =
1n−1[n(σ2 + µ2)− nE(X
2)] = 1
n−1[n(σ2 + µ2)− n(σ2
n + µ2)] =
1n−1[(n− 1)σ2)] = σ2.
Note: If the mean or expectation of a statistic is equal to
the corresponding parameter, the statistic is called an unbi-
ased estimator of the parameter. Hence X and S2 are un-
biased estimators of µ and σ2 respectively. An estimate of
a population parameter given by a single number is called a
point estimate e.g. if we take a sample of size n and calculate
µS = x1+x2+...+xnn and S2 =
n∑1
(xi−x)2n−1 , then these are unbiased
point estimates of µ and σ2 respectively. We shall, however,
concentrate on interval estimates, where the parameter lies
within some interval, called a confidence interval.
48
Confidence Intervals for µ
Suppose we have n i.i.d. random variables X1, X2, ..., Xn with
E(Xi) = µ and var(Xi) = σ2 for each 1 ≤ i ≤ n. Then if
X = X1+X2+...+Xnn , we get E(X) = µ and var(X) = σ2
n .
As before, X1, X2, ..., Xn are jointly distrbuted random vari-
ables defined on the product sample space. We have the very
important result:
Central Limit Theorem: For large n (≥ 30) the prob-
ability distribution of X is approximately normal with mean
µ and variance σ2
n i.e. N(µ, σ2
n ) or, in other words, X−µσ√n
is
approximately N(0, 1). The larger the n the better the ap-
proximation.
Note: X1, X2, ..., Xn orX need not be normal. If, however,
they are normal, then X−µσ√n
is approximately N(0, 1) for all
values of n.
Recall N(0, 1).
49
If we want P (−z1 ≤ Z ≤ z1) = 95%, then Φ(z1) = 97.5%, so
z1 = 1.96. We say that −1.96 ≤ Z ≤ 1.96 is a 95% confidence
interval for N(0, 1).
Hence P (−1.96 ≤ X−µσ√n≤ 1.96) = 95%⇐⇒
P (−1.96× σ√n≤ X − µ ≤ 1.96× σ√
n) = 95%⇐⇒
P (−1.96× σ√n−X ≤ −µ ≤ 1.96× σ√
n−X) = 95%⇐⇒
P (1.96× σ√n
+ X ≥ µ ≥ −1.96× σ√n
+ X) = 95%⇐⇒
P (X − 1.96× σ√n≤ µ ≤ X + 1.96× σ√
n) = 95%.
If we know σ this gives us a 95% confidence interval for µ i.e.
given any random sample there is a 95% probability that µ lies
within the above interval or we can say with 95% confidence
that µ is between the two limits of the interval. Put another
way, 95% of samples will have µ in the above interval.
Example: A sample of size 100 is taken from a population
with unknown mean µ and variance 9. Determine a 95% con-
fidence interval for µ if the sample mean is 5.
Here X = 5, σ = 3 and n = 100.
P (X − 1.96× σ√n≤ µ ≤ X + 1.96× σ√
n) = 95%⇐⇒
50
P (5− 1.96× 310 ≤ µ ≤ 5 + 1.96× 3
10) = 95%⇐⇒
P (4.412 ≤ µ ≤ 5.588) = 95%.
Example: A sample of size 80 is taken from the workers
in a very large company. The average wage of the sample of
workers is 25,000 euro. If the standard deviation of the whole
company is 1,000 euro, construct a confidence interval for the
mean wage in the company at the 95% level.
Here X = 25, 000, σ = 1, 000 and n = 80.
P (X − 1.96× σ√n≤ µ ≤ X + 1.96× σ√
n) = 95%⇐⇒
P (25, 000 − 1.96 × 1,000√80≤ µ ≤ 25, 000 + 1.96 × 1,000√
80) =
95%⇐⇒ P (24, 781 ≤ µ ≤ 25, 219) = 95%.
We can have different confidence intervals:
Let α be a small percentage ( 5% above). Then
P (X − zα2× σ√
n≤ µ ≤ X + zα
2× σ√
n) = 1 − α gives an α
confidence interval, where Φ(zα2) = 1− α
2 .
51
Note: For a 95% interval we have Φ(zα2) = .975, so zα
2=
1.96 and for a 99% interval Φ(zα2) = .995, so zα
2= 2.575.
Example: Determine a 99% confidence interval for the
mean of a normal population if the population variance is σ2 =
4.84, using the sample 28, 24, 31, 27, 22.
(Note that we need normality here since the sample size < 30).
X = 28+24+31+27+225 = 26.4 and σ =
√4.84 = 2.2. Then
P (26.4− 2.575×2.2√5≤ µ ≤ 26.4− 2.575×2.2√
5) = .99⇐⇒
P (23.867 ≤ µ ≤ 28.933) = .99.
Example: If we have a normally distributed population
with σ2 = 9, how large must a sample be if the 95% confidence
interval has length at most 0.4?
In general the length of the confidence interval is
(X + zα2× σ√
n) − (X − zα
2× σ√
n) = 2zα
2× σ√
n. For the 95%
confidence interval Zα2
= 1.96, so 2 × 1.96 × 3√n≤ .4, which
gives√n ≥ 2×1.96×3
.4 or n = 865.
In all the previous examples we knew σ2, the population vari-
ance. If that is not so and n ≥ 30 we can use S2 as a point
52
estimate for σ2 and assume that X−µS√n
is approximatelyN(0, 1).
Example: A watch-making company wants to investigate
the average life µ of its watches. In a random sample of 121
watches it is found that X = 14.5 years and S = 2 years.
Construct a (i) 95%, (ii) 99% confidence interval for µ.
(i) 14.5− 1.96× 211 ≤ µ ≤ 14.5 + 1.96× 2
11 ⇐⇒
14.14 ≤ µ ≤ 14.86.
(ii) 14.5− 2.575× 211 ≤ µ ≤ 14.5 + 2.575× 2
11 ⇐⇒
14.03 ≤ µ ≤ 14.97.
Note that the greater the confidence the greater the interval.
If n is small (< 30) this is not very accurate, even if the original
X is normal. In this case we must use the following:
Theorem: If X1, X2, ..., Xn are independent normally dis-
tributed random variables, each with mean µ and variance σ2,
then the random variable X−µS√n
is a t-distribution with n − 1
degrees of freedom.
We denote the number of degrees of freedom n− 1 by ν. For
each ν the t-distribution is a symmetric bell-shaped distribu-
53
tion.
For ν = ∞ we get the standard normal distribution N(0, 1).
The statistical tables usually read P (|T | > k) for each ν.
Example: ν = 5. P (|T | > k) = 0.01 Then k = 2.015, so
P (−2.015 ≤ T ≤ 2.015) = 99%.
Example: A certain population is normal with unknown
mean and variance. A sample of size 20 is taken. The sample
mean is 15.5 and the sample variance is 0.09. Obtain a 99%
confidence interval for µ, the population mean.
54
Since n = 20 < 30 we must use the t-distribution with ν =
n− 1 = 19. We have X = 15.5 and S2 = 0.09. For ν = 19 we
have P (|T | > k) = 0.01 giving k = 2.861.
Now X−µS√20
is a t-distribution with 19 degrees of freedom, so
P (−2.861 ≤ 15.5−µ0.3√20
≤ 2.861) = 99%⇐⇒
P (15.5− 2.861×0.3√20≤ µ ≤ 15.5 + 2.861×0.3√
20) = 99%⇐⇒
P (15.308 ≤ µ ≤ 15.692) = 99%.
Example: Five independent measurements, in degrees F,
of the flashpoint of diesel oil gave the reults 144, 147, 146, 142, 144.
Assuming normality, determine a (i) 95%, (ii)99% confidence
interval for the mean flashpoint.
Since n < 30 we must apply the t-distribution. n = 5, so
ν = 4. We have X = 144+147+146+142+1445 = 144.6. Also
S2 = (−.6)2+(2.4)2+(1.4)2+(−2.6)2+(−.6)25 = 3.8, so S = 1.949.
(i) P (144.6−2.776×1.949√5
≤ µ ≤ 144.6+2.776×1.949√5
) = 95%⇐⇒
P (142.18 ≤ µ ≤ 147.02) = 95%.
(ii) P (144.6 − 4.604×1.949√5
≤ µ ≤ 144.6 + 4.604×1.949√5
) =
99%⇐⇒ P (140.59 ≤ µ ≤ 148.61) = 99%.
55
Hypothesis Testing
Suppose that a claim is made about some parameter of a pop-
ulation, in our case always the population mean µ. This claim
is called the null hypothesis and is denoted by H0. Any claim
that differs from this is called an alternative hypothesis, de-
noted by H1. We must test H0 against H1.
Example: H0 : µ = 90.
Possible alternatives are
H1 : µ 6= 90
H1 : µ > 90
H1 : µ < 90
H1 : µ = 95.
We must decide whether to accept or reject H0. If we reject
H0 when it is in fact true we commit what is called a type I
error and if we accept H0 when it is in fact false we commit a
typeII error. The maximum probability with which we would
be willing to risk a type I error is called the level of significance
of the test, usually 10%, 5% or 1%. We perform a hypothesis56
test by taking a random sample from the population.
Suppose that we are given H0 : µ = µ0, some fixed value.
(i) We suspect that µ 6= µ0. This is our H1.
We take a random sampleX . We might haveX > µ0, X < µ0
or X = µ0. Now if the mean is µ0, then X is approximately
N(µ0, σ2) or X−µ0
σ√n
is approximately N(0, 1).
There is a 5% probability that X is in either of the end re-
gions of N(µ0, σ2) or, equivalently, X−µ0σ√
nis in either of the end
regions of N(0, 1). If our X−µ0σ√n
is in this ”rejection region”
we reject H0 at th 5% significance level. Otherwise we do not
reject H0. This is called a ”two-tailed” test.
(ii) We suspect that µ > µ0. This is our H1.
Our X is now > µ0 (this is why we suspect that µ > µ0.) We
only check for probability on the right-hand side.
57
Again if the mean is µ0, then if our X−µ0σ√n
is in this ”rejection
region” we reject H0 at th 5% significance level. Otherwise we
do not reject H0. This is called a ”one-tailed” test.
Note that a bigger µ0 may push X into the non-rejection re-
gion.
(iii) We suspect that µ < µ0. This is our H1.
This is the same as (ii) but on the left.
58
Example: A battery company claims that its batteries
have an average life of 1,000 hours. In a sample of 100 bat-
teries it was found that X = 985 hours and S = 30 hours.
Test the hypothesis H0 : µ = 1, 000 hours against the alterna-
tive hypothesis H1 : µ 6= 1, 000 hours at the 5% significance
level, assuming that the lifetime of the batteries is normally
distributed.
n = 100 > 30 so we can take S for σ. If µ = 1, 000, then X−µS√n
is approximately N(0, 1). We are interested in extreme values
of X on both sides of µ = 1, 000 so we use a ”two-tailed” test.
Values of X−µS√n
will be between -1.96 and 1.96 95% of the time.
For our sample X−µS√n
= 985−1,00030√100
= −5, which is (deep) in the
rejection region. So we reject H0 at the 5% significance level.
There is a 5% probability of a type I error.
59
Example: A researcher claims that 10 year old children
watch 6.6 hours of television daily. In a sample of 100 it was
found thatX = 6.1 hours and S = 2.5 hours. Test the hypoth-
esis H0 : µ = 6.6 hours against the alternative H1 : µ 6= 6.6
hours at the (i) 5%, (ii)1% significance levels.
n = 100 > 30, so we can take S for σ.
Then X−µS√n
is approximately N(0, 1).
(i) If µ = 6.6, then X−µS√n
is between -1.96 and 1.96 with
probability 95%. But X−µS√n
= 6.1−6.62.510
= −2 is in the rejection
region. We reject H0 at the 5% level.
(ii) If µ = 6.6, then X−µS√n
is between -2.575 and 2.575 with
probability 99%. But, as above, X−µS√n
= 6.1−6.62.510
= −2, which
now is in the non-rejection region. We do not reject H0 at the
1% level.
60
Example: A manufacturer produces bulbs that are sup-
posed to burn with a mean life of at least 3,000 hours. The
standard deviation of 500 hours. A sample of 100 bulbs is
taken and the sample mean is found to be 2,800 hours. Test
the hypothesis H0 : µ ≥ 3, 000 hours against the alternative
H1 : µ < 1, 000 hours at the 5% significance level.
In this case if ourX value is greater than 3,000 we do not reject
it since it agrees with H0, so we are only interested in extreme
values on the left. We use a ”one-tailed” test. Again X−µσ√n
is approximately N(0, 1) and X−µσ√n≥ −1.645 with a proba-
bility of 95%. But X = 2, 800, n = 100 and σ = 500, so
X−µσ√n
= 2,800−3,00050010
= −4. Hence we reject H0 at the 95% level.
We also need to use the t-distribution.
Example: We need to buy a length of a certain type of wire.
The manufacturer claims that the wire has a mean breaking
61
limit of 200 kg or more. We suspect that the mean is less. We
have H0 : µ ≥ 200 and H1 : µ < 200. We take a random
sample of 25 rolls of wire and find that X = 197 kg and
S = 6 kg. Test H1 against H0 at the 5% level, assuming the
breaking limit of the wire is normally distributed.
Here n = 25 < 30, so we must use a t-distribution with ν = 24.
If the mean is µ, then X−µS√n
is at-distribution with 24 degrees
of freedom.
P (|T | > 1.711) = 10% and X−µS√n
= 197−2006√25
= −2.5, which is
in the rejection region. We reject H0.
62