3e1 - probability notes

62
Lecture Notes on Probability and Statistics Joe ´ Oh ´ Og´ ain E-mail: [email protected] Main Text: Kreyszig; Advanced Engineering Mathematics Other Texts: Schaum Series, Robert B. Ash, Hayter Online Notes: Hamilton.ie EE304, Prof. Friedman Lec- tures 1

Upload: jaspreet46

Post on 31-Jan-2016

265 views

Category:

Documents


0 download

DESCRIPTION

probability

TRANSCRIPT

Page 1: 3E1 - Probability Notes

Lecture Notes

on

Probability and Statistics

Joe O hOgain

E-mail: [email protected]

Main Text: Kreyszig; Advanced Engineering Mathematics

Other Texts: Schaum Series, Robert B. Ash, Hayter

Online Notes: Hamilton.ie EE304, Prof. Friedman Lec-

tures

1

Page 2: 3E1 - Probability Notes

Probability Function

Definition: An experiment is an operation with well-defined

outcomes.

Definition: The sample space S of an experiment is set of

all possible outcomes.

Examples: Toss a coin: S = H,T

Throw a die: S = 1, 2, 3, 4, 5, 6

Toss a coin twice: S = HH,HT, TH, TT.

Toss a coin until H appears and count the number of times it

is tossed: S = 1, 2, 3, ...,∞, where ∞ means that H never

appears.

Definition: Any subset of S is called an event.

Let P(S) be the set of all events in S i.e. the collection of all

subsets of S.

Definition: A probability function on S is a function

P : P(S) −→ [0, 1] such that

(i) P (S) = 1

(ii) P (A1A2...An...) = P (A1) + P (A2) + ... + P (An) + ...,2

Page 3: 3E1 - Probability Notes

where A1.A2, ..., An, ... are mutually exclusive i.e.AiAj = φ

for all i 6= j.

Theorem 1: P (φ) = 0.

Proof: For any AS,Aφ = A and Aφ = φ, so P (A) =

P (Aφ) = P (A) + P (φ); hence P (φ) = 0.

Theorem 2: P (Ac) = 1− P (A).

Proof: AAc = S and AAc = φ, so 1 = P (S) = PAAc =

P (S) + P (Ac).

Theorem 3: P (A) ≤ 1, for all AS.

Proof: From Theorem we get 2 P (A) = 1− P (Ac) ≤ 1.

Theorem 4: AB ⇐⇒ P (A) ≤ P (B).

Proof: B = A(B − A) ⇐⇒ P (B) = P (A) + P (B − A)

and P (B − A) ≥ 0.

3

Page 4: 3E1 - Probability Notes

Finite Sample Space

An event containing one element is a singleton. If S contains

n elements x1, x2, ..., xn say, and each one has the same prob-

ability p of occuring, then 1 = P (S) = P (x1, x2, ..., xn) =

P (x1x2...xn) = P (x1)+P (x2)+ ...+P (xn) =

p + p + ... + p = np, so p = 1n. Then, for any AS we have

P (A) = |A||S| , where |A| means the number of elements of A.

Conversely, if we define P on S by this formula, then it is easy

to check that P gives a probability function on S.

Example: A card is selected from a pack of 52 cards. Let

A = hearts , B = face cards . Then P (A) = 1352, P (B) =

1252, P (AB) = 3

52 and P (AB) = 1352 + 12

52 −352 = 22

52, using

Theorem 6 ( or just count them).

In general singletons need not be equiprobable. Then let P (xi) =

pi for 1 ≤ i ≤ n. (We write P (xi) for P (xi) for con-

venience). We have P (A) =∑xi∈A

P (xi) and 1 = P (S) =

n∑i=1

P (xi.) We can form a table, called a probability distribu-

tion table4

Page 5: 3E1 - Probability Notes

where pi = P (xi) for all 1 ≤ i ≤ n. Again, going backwards,

the table defines a probability function on S.

Example: Three horses A,B,C race against each other. A is

twice as likely to win as B and B is twice as likely to win as C.

Assuming no deadheats find P (A), P (B), P (C) and P (AC).

Let P (C) = p. Then P (B) = 2p, P (A) = 4p. Hence p+ 2p+

4p = 1, so 7p = 1 or p = 17. Then P (C) = 1

7, P (B) = 27 and

P (A) = 47. Also P (BC) = P (B) + P (C) = 3

7.

5

Page 6: 3E1 - Probability Notes

Countably Infinite Sample Space

In this case S = x1, x2, ..., xn, ... with P (xi) = pi for all

i ≥ 1. Then 1 =∞∑i=1

pi.

Example: Toss a coin until H appears and count the num-

ber of times it is tossed: S = 1, 2, 3, ...,∞, where∞ means

that H never appears. Set P (1) = 12, P (2) = 1

4, ..., P (n) = 12n

for all n. (P (∞) = 0.)

Let A = 1, 2, 3. Then P (A) = 12 + 1

4 + 18 = 7

8 = the proba-

bility of H in the first 3 throws.

If B = 2, 4, 6, ..., then the probability of H on an even throw

is P (B) = 122

+ 124

+ 126

+ ... =122

1− 122

= 13.

Note that P (S) = 12+ 1

22+ 1

24+... =

12

1−12

= 1 and the probability

of H on an odd throw is 23.

6

Page 7: 3E1 - Probability Notes

Conditional Probability

Let A,E be events in S with P (E) 6= 0.

Definition: The conditional probability of A given E is de-

fined by P (A|E) = P (A∩E)P (E) .

Example: If S is a finite equiprobable space, then P (A|E) =

P (A∩E)P (E) =

|A∩E||S|E||S|

= |A∩E||E| .

Example: Pair of dice. S = (1, 1), (1, 2), ..., (6, 6). Find

the probability that one die shows 2 given that the sum is

6. Let A be the probability that one die is 2 and E be the

probability that the sum is 6. Then P (A|E) = 25.

Note that, in general, P (A ∩ E) = P (E)P (A|E).

Example: Let A,B be events with P (A) = ·6, P (B) = ·3

and P (A ∩B) = ·2. Then

P (A|B) = ·2·3, P (B|A) = ·2

·6, P (A ∪ B) = ·6 + ·3 − ·2 =

·7, P (Ac) = 1 − ·6 = ·4, P (Bc) = 1 − ·3 = ·7, P (Ac ∩ Bc) =

P ((A ∪B)c) = 1− ·7 = ·3, P (Ac|Bc) = ·3·7, P (Bc|Ac) = ·3

·4.

Example: The probability that a certain flight departs on

time is ·8 and the probability that it arrives on time is ·9. The7

Page 8: 3E1 - Probability Notes

probability that it both departs and arrives on time is ·78.

Find the probability that

(i) it arrives on time given that it departed on time,

(ii) does not arrive on time given that it did not depart on

time.

The sample space may be taken as (D,A), (D,Ac), (Dc, A), (Dc, Ac).

Let B = (D,A), (D,Ac) and let C = (D,A), (Dc, A)).

Then P (B) = ·8, P (C) = ·9 and

(i) P (C|B) = ·78·8 ,

(ii) P (Cc|Bc) = P (Cc∩Bc)P (Bc) = P (C∪B)c

P (Bc) = 1−P (C∪B)1−P (B) = 1−(·8+·9+·78)

1−·8 .

Suppose that S = A1∪A2∪ ...∪An, where Ai∩Aj = ∅ for all

i 6= j. We say that the Ai are mutually exclusive and form a

partition of S. Let E ⊆ S. Then E = E∩S = E∩(A1∪A2∪

...∪An) = (E ∩A1)∪ (E ∩A2)∪ ...∪ (E ∩An), disjoint, So

P (E) = P (E∩A1)+P (E∩A2)+...P (E∩An) =n∑i=1

P (E∩Ai).

Now P (E ∩ Ai) = P (Ai)P (E|Ai), for each i, so

P (E) =n∑i=1

P (Ai)P (E|Ai). This is called the Law of Total

Probability. We also have P (Aj|E) =P (Aj∩E)p(E) =

P (E∩Aj)p(E) =

8

Page 9: 3E1 - Probability Notes

P (Aj)P (E|Aj)P (E) =

P (Aj)P (E|Aj)n∑i=1

P (Ai)P (E|Ai)for all j. This is known as Bayes’

Formula or Theorem. We use it if we know all the P (E|Ai).

Example: Three machines X, Y, Z produce items.X pro-

duces 50%, 3% of which are defective. Y produces 30%, 4%

of which are defective and Z produces 20%, 5% of which are

defective. Let D be the event that an item is defective. Let

an item be chosen at random.

(i) Find the probability that it is defective.

(ii) Given that it is defective, find the probability that it

came from machine X .

Let A1 be the event consisting of elements of X , let A2 be

the event consisting of elements of Y and let A3 be the event

consisting of elements of Z. Then P (A1) = ·5, P (A2) = ·3

and P (A3) = ·2. Also P (D|A1) = ·03, P (D|A2) = ·04 and

P (D|A3) = ·05.

(i) P (D) = (·5)(·03) + (·3)(·04) + (·2)(·05) = ·037 = 3 · 7%.

(ii) P (A1|D) = P (A1)P (D|A1)P (D) = (·5)(·03)

·037 = ·405 = 40 · 5%.

We often use a ”tree diagram”:

9

Page 10: 3E1 - Probability Notes

Example: A hospital has 300 nurses. During the past year

48 of the nurses got a pay rise. At the beginning of the year

the hospital offered a training seminar which was attended by

138 of the nurses. 27 of the nurses who got a pay rise attended

the seminar. What is the probability that a nurse who got a

pay rise attended the seminar?

Let A be the event consisting of nurses who attended the sem-

inar and Let B be the event consisting of nurses who got a

pay rise. Then P (A) = 138300 and P (Ac) = 162

300. Also P (B|A) =

27138, P (Bc|A) = 111

138, P (B|Ac) = 21162 and P (Bc|Ac) = 141

300.

Therefore P (B) = (138300)( 27138) + (162300)( 21

162) = 48300, which is

obvious from the beginning. Also P (A|B) = P (A)P (B|A)P (B) =

(138300)(27138)

48300

= 2748.

10

Page 11: 3E1 - Probability Notes

Exercise: In a certain city 40% vote Conservative, 35%

vote Liberal and 25% vote Independent. During an election

45% of conservative, 40% of Liberal and 60% of Independent

voted. A person is selected at random. Find the probability

that the person voted. If the person voted, find the probability

that the voter is (i) Con. (ii) lib. and (iii) Ind.

11

Page 12: 3E1 - Probability Notes

Independent Events

Definition Two events A,B ⊆ S are independent if

P (A ∩B) = P (A)P (B).

IfA andB are independent then P (A|B) = P (A∩B)P (B) = P (A)P (B)

P (B) =

P (A) i.e. the conditional probability of A given B is the same

as the probability of A. The converse is obviously true.

Note: A,B are mutually exclusive if and only if A∩B = ∅,

if and only if P (A ∩B) = 0, hence A,B are not independent

unless either P (A) = 0 or P (B) = 0.

Example: Pick a card. let A be the event consisting of

hearts and let B be the event consisting of face-cards. Then

P (A) = 1352, P (B) = 12

52 and P (A∩B) = 352. Hence P (A∩B) =

P (A)P (B) and so A and B are independent events.

Example: Toss a fair coin three times.

S = HHH,HHT,HTH,HTT, THH, THT, TTH, TTT

Let A be the event where the first toss is heads, B be the event

where the second toss is heads and let C be the event where

there are exactly two heads in a row. Then P (A) = 48, P (B) =

12

Page 13: 3E1 - Probability Notes

48, P (C) = 2

8, P (A ∩B) = 28, P (A ∩ C) = 1

8, P (B ∩ C) = 28.

Hence A,B are independent, A,C are independent but B,C

are not independent.

Example: The probability that A hits a target is 14. The

probability that B hits the target is 25. Assume that A and B

are independent. What is the probability that either A or B

hits the target?

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = P (A) + P (B) −

P (A)P (B) = 14 + 2

5 −14 ×

25 = 11

20.

Exercise: Show that A,B independent ⇐⇒ A,Bc inde-

pendent and hence Ac, Bc independent.

13

Page 14: 3E1 - Probability Notes

Product Probability and Independent Trials

Let S = a1, a2, ..., as and T = b1, b2, ..., bt be the sample

spaces for two experiments. LetPS(ai) = pi and PT (bj) = qj

for all 1 ≤ i ≤ s and 1 ≤ j ≤ t, where PS and PT are the

probability functions on S and T respectively.

Let S×T = (ai, bj)|ai ∈ S, bj ∈ T. Define a function P on

P(S × T ) by P ((ai, bj)) = piqj and addition. Then P is a

probability function on S × T :

(i) piqj ≥ 0 for all i, j.

(ii) P (S×T ) = p1q1+p1q2+...+p1qt+...+psq1+psq2+...+

psqt = p1(q1+ ...+qt)+ ...+ps(q1+ ...+qt) = p1+ ...+ps = 1.

(iii) Obvious by addition.

P above is called the product probability on S × T . It is not

the only probability function on S × T. We can extend this

definition to the product of any finite number of sample spaces.

Suppose that S × T has the product probability P .

Let A = ai × T and B = S × bj.

Then P (A) = P (ai × T ) = PS(ai)× PT (T ) = pi × 1 = pi14

Page 15: 3E1 - Probability Notes

and P (B) = P (S × bj) = PS(S) × PT (bj) = qj. Now

A ∩ B = (ai, bj) so P (A ∩ B) = P(ai, bj) = piqj =

P (A)P (B) and hence A and B are independent. Similarly,

any two events of the form C×T and S×D are independent,

where C ⊆ S and D ⊆ T.

Conversely, suppose that P is a probability function on S×T

such that P (ai × T ) = pi and P (S × bj) = qj for all

ai ∈ S and bj ∈ T and all sets of this form are independent.

Then P(ai, bj) = P ((ai × T ) ∩ (S × bj)) =

P (ai × T ) × P (S × bj) = piqj, so that P must be the

product probability.

We deduce that the product probability is the unique proba-

bility on S × T with these two ”independence properties”.

Example: When three horses A,B,C race against each

other their respective probabilities of winning are always 12,

13

and 16. Suppose they race twice. Then, assuming independence,

the probability of C winning the first race and A winning the

second race is 16 ×

12 = 1

12 etc.

15

Page 16: 3E1 - Probability Notes

Now suppose that we perform the same experiment a number

of times. The sample space S × S × ...× S consists of tuples.

If we assume that the experiments are independent, then the

probability function on this sample space is the product prob-

ability. If we do it n times we say that we have n independent

trials.

Example: Toss a coin three times as before. Now e.g.

P (HTH) = 12 ×

12 ×

12 = 1

8 etc. This is the same probability

function as assuming that all triples are equiprobable, as be-

fore. Hence we can consider the problem in either of the two

ways.

16

Page 17: 3E1 - Probability Notes

Counting Techniques

Suppose we have n objects. How many permutations of size

1 ≤ r < n can be made?

Using the Fundamental Principle of counting gives n!(n−r)!, which

is written nPr.

How many combinations of size 1 ≤ r < n can be made?

Let the answer be nCr. Each of these combinations gives r!

permutations, so nCr × r! = nPr. Hence nCr =nPrr! = n!

(n−r)!r!.

If r = 0 or r = n the answer is 1, so if we agree tthat 0! = 1,

we can use the formula for all 0 ≤ r ≤ n.

A lot of problems in finite probability can be done from ”first

principles” i.e. using ”boxes”.

Example: The birthday problem: How many people do

we need to ensure that the probability of at least two having

the same birthday is greater than 12?

Let the answer be n. Then 1 − 365Pn(365)n >

12, so n = 23. For

more on perms. and combs. see Schaum Series.17

Page 18: 3E1 - Probability Notes

Random Variables

Definition: Let S be a sample space with probability func-

tion P . A random variable (R.V.) is a function X : S −→ R.

The image or range of X is denoted by RX . If RX is finite

we say that X is a finite or finitely discrete R.V., if RX is

countably infinite we say that X is a countably infinite or

infinitely discrete R.V. and if RX is uncountable we say that

X is an uncountable or continuous R.V.

Example: (i) Throw a pair of dice. S = (1, 1), ..., (6, 6).

Let X : S −→ R be the maximum number of each pair and

let Y : S −→ R be the sum of the two numbers. Then RX =

1, 2, 3, 4, 5, 6 and RY = 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. X

and Y are finite R.V.s

(ii) Toss a coin untilH appears. S = H,TH, TTH, TTTH, ....

Let X : S −→ R be the number of times the coin is tossed

or (the number of T s) +1. Then RX = 1, 2, 3, ...∞. X is a

countably infinite R.V.

(iii) A point is chosen on a disc of radius 1.18

Page 19: 3E1 - Probability Notes

S = (x, y)|x2 + y2 ≤ 1. Let X −→ R be the distance of

the point from the centre (0, 0). Then RX = [0, 1] and X is a

continuous R.V.

19

Page 20: 3E1 - Probability Notes

Finite Random Variables

Let X : S −→ RX be finite with RX = x1, x2, ..., xn say.

X induces a function f on RX by f (xk) = P (X = xk) =

p(s ∈ S|X(s) = xk). f is called a probability distribution

function (p.d.f.). Note that f (xk) ≥ 0 andn∑k=1

f (xk) = 1.

We can extend f to all of R by defining f (x) = 0 for all

x 6= x1, x2, ..., xn. We often write f using a table:

Recall the idea of a discrete frequency distribution:

If we let f (xk) = fkn∑i=1

fi

, the relative frequency, we get a proba-

bility distribution. Its graph is a bar-chart:

Note: We are usually interested in the distribution of a

particular R.V. rather than the underlying sample space e.g.20

Page 21: 3E1 - Probability Notes

the heights or ages of a given set of people.

Example: X, Y from example (i) above:

X

Y

Exercise: A fair coin is tossed three times. Let X be the

number of heads. RX = 0, 1, 2, 3. Draw the distribution

table for X .

21

Page 22: 3E1 - Probability Notes

Definition: Let

be the probability distribution of a R.V. X . The expecta-

tion or mean of X is defined by

E(X) = x1f (x1) + x2f (x2) + ... + xnf (xn) =n∑k=1

xkf (xk).

This is the same definition as the mean in the case of a fre-

quency distribution.

Example: X, Y from (i) above again:

E(X) = 1× 136 + 2× 3

36 + ... + 6× 1136 = 4 · 47,

E(Y ) = 2× 136 + 3× 2

36 + ... + 12× 136 = 7.

Note that E(X) need not belong to RX .

We write µX or simply µ, if there is no confusion, for E(X).

E(X) is the weighted average where the weights are the prob-

abilities. We can apply this definition to games of chance:

a game of chance is an experiment with n outcomes, a1, a2, ..., an

and corresponding probabilities p1, p2, ..., pn. Suppose the pay-

out for each ai is wi. We define a R.V. X by X(ai) = wi.

22

Page 23: 3E1 - Probability Notes

Then the average payout is E(X) =n∑i=1

wipi. We would play

the game if E(X) > 0.

Example: A fair die is thrown. If 2, 3 or 5 occur we win

that number of euros. If 1, 4 or 6 occur we lose that number

of euros. Should we play?

We have a distribution table

Then E(X) = 2× 16 +3× 1

6 +5× 16−1× 1

6−4× 16−6× 1

6 = −16.

Don’t play!

Definition X a finite R.V. The variance ofX is defined by

var(X) = E((X−µ)2) = (x1−µ)2f (x1)+(x2−µ)2f (x2)+...+

(xn − µ)2f (xn) =n∑k=1

(xk − µ)2f (xk). The standard deviation

of X is defined to be σX =√var(X). If X is understood, we

just write σ.

Note: var(X) = E((X − µ)2) = σ2 =n∑k=1

(xk − µ)2f (xk)

=n∑k=1

(x2k − 2xkµ + µ2)f (xk)23

Page 24: 3E1 - Probability Notes

=n∑k=1

x2kf (xk)− 2µn∑k=1

xkf (xk) + µ2n∑k=1

f (xk)

= E(X2)− 2µ2 + µ2 = E(X)2 − µ2 = E(X2)− (E(X))2.

Example: X, Y from (i) above again:

E(X2) =6∑

k=1

x2kf (xk) = 12× 136+22× 3

36+...+62× 1136 = 21·97.

Hence var(X) = E(X2)− (E(X))2 = 21 ·97−20 ·25 = 1 ·99.

E(Y 2) =11∑k=1

y2kf (yk) = 22× 136+32× 2

36+...+122× 136 = 54·8.

Hence var(Y ) = E(Y 2)− (E(Y ))2 = 54 · 8− 49 = 5 · 8.

Theorem: If X is a finite R.V. and a, b ∈ R, then

(i) E(aX) = aE(X),

(ii) E(X + b) = E(X) + b.

Proof: (i) E(aX) =n∑k=1

axkf (xk) = an∑k=1

xkf (xk) =

aE(X).

(ii) E(X+b) =n∑k=1

(xk+b)f (xk) =n∑k=1

xkf (xk)+bn∑k=1

f (xk) =

E(X) + b.

Hence E(aX + b) = E(aX) + b = aE(X) + b.

Theorem: If X is a finite R.V. and a, b ∈ R, then

(i) var(aX) = a2var(X),

(ii) var(X + b) = var(X).

24

Page 25: 3E1 - Probability Notes

Proof: (i) var(aX) = E((aX)2)−(E(aX))2 = E(a2X2)−

(aE(X))2 = a2E(X2)−a2(E(X))2 = a2(E(X2)−(E(X))2 =

a2var(X). We could also just use the definition of var(X).

(ii) var(X + b) = E((X + b)2)− (E(X + b))2

=n∑k=1

(xk + b)2f (xk)− (E(X) + b)2

=n∑k=1

(x2k + 2bxk + b2)f (xk) − (E(X) + b)2 =n∑k=1

x2kf (xk) +

2bn∑k=1

xkf (xk) + b2n∑k=1

f (xk)− (E(X))2 − 2bE(X)− b2

= E(X2) + 2bE(X) + b2 − (E(X))2 − 2bE(X)− b2

= E(X2)− (E(X))2 = var(X).

Hence var(aX + b) = var(aX) = a2var(X) and σaX+b =

|a|σX .

Definition: If X is a R.V. with mean µ and standard

deviation σ, then the standardized R.V. associated with X is

defined as Z = X−µσ .

E(Z) = E(X−µσ ) = E(X)−µσ = µ−µ

σ = 0 and var(Z) =

1σ2var(X) = σ2

σ2= 1.

So Z has mean 0 and standard deviation 1.

25

Page 26: 3E1 - Probability Notes

Definition: Let X be a finite R.V. with p.d.f. f (x). The

function F : R −→ R defined by F (x) = P (X ≤ x) =∑xk≤x

f (xk) is called the cumulative distribution function (c.d.f.)

of X .

Example: Suppose X is defined by

Then F (−2) = 14, F (1) = 3

8, F (2) = 78, F (4) = 1.F (x) is obvi-

ous for all other x.

F is a ”step function”. In general F is always an increasing

function since f (xk) ≥ 0 for all xk.

Note: The c.d.f. is more important for continuous distri-

butions (see later).

26

Page 27: 3E1 - Probability Notes

The Binomial Distribution

One of the most important finite distributions is the binomial

distribution. Suppose we have an experiment with only two

possible outcomes, called success and failure i.e. S = s, f.

Let P (s) = p and P (f ) = q. Then p + q = 1. Each such

experiment is called a Bernouilli trial. Suppose we repeat

the experiment n times and assume that the trials are in-

dependent. The sample space for the n Bernouilli trials is

S×S× ...×S and a typical elelment in the sample space looks

like (a1, a2, ..., an), where each ai = s or f . Define a R.V. X

on this sample space by X(a1, a2, ..., an) = the number of suc-

cesses in (a1, a2, ..., an). Then RX = 0, 1, 2, ..., n. The p.d.f.

of X, f (x), is given by f (0) = qn, f (1) = nC1pqn−1, f (2) =

nC2p2qn−2 etc. In general f (k) = nCkp

kqn−k.

Then f (0)+f (1)+...+f (n) =n C0qn+nC1pq

n−1+nC2p2qn−2+

...+n Ckpkqn−k + ...+n Cnp

n = (p+ q)n = 1n = 1, by the Bi-

nomial Theorem. X is said to have the binomial distribution,

written as B(n, p).27

Page 28: 3E1 - Probability Notes

Example: Afair coin is tossed six times. Success is defined

to be heads. Here n = 6, p = 12, q = 1

2. Find the probability of

getting

(i) exactly two heads,

(ii) at least four heads,

(iii) at least one head.

(i) f (2) = 6C2(12)2(12)4 = 15

64.

(ii) f (4)+f (5)+f (6) = 6C4(12)4(12)2+ 6C5(

12)5(12)1+ 6C6(

12)6 =

2264.

(iii) 1− f (0) = 1− (12)6 = 6364.

Example: The probability of hitting a target at any time

is 13. If we take seven shots, what is the probability of

(i) exactly three hits?

(ii) at least one hit?

(i) f (3) = 7C3(13)3(23)4 = 0 · 26.

(ii) 1− (23)7 = 0 · 94.

Example: Find the number of dice that must be thrown

such that there is a better than even chance of getting at least

28

Page 29: 3E1 - Probability Notes

one six.

Here p = 16, q = 5

6. Let n be the required number. Then

1 − (56)n > 12, so (56)n < 1

2 or n ln(56) < ln(12) i.e. n <ln(12)

ln(56).

Hence n = 4.

Example: If 20% of the bolts produced by a machine are

defective, find the probability that out of 10 bolts chosen at

random,

(i) none

(ii) one

(iii) greater than two

bolts will be defective. Here n = 10, p = ·2, q = ·8.

(i) f (0) = (·8)10,

(ii) f (1) = 10C1(·2)1(·8)9,

(iii) 1− (f (0) + f (1) + f (2)) = 1− ((·8)10+ 10C1(·2)1(·8)9+

10C2(·2)2(·8)8).

29

Page 30: 3E1 - Probability Notes

Continuous Random Variables

Suppose that X is a R.V. on a sample space S whose range RX

is an interval in R or all of R. Then X is called a continuous

R.V. If there exists a piece-wise continuous function

f : R −→ R such that P (a ≤ X ≤ b) =b∫a

f (x)dx for any

a < b, where a, b ∈ S, then f is called the probability distri-

bution function (p.d.f.) or density function for X .

f must satisfy f (x) ≥ 0 for all x and∞∫−∞

f (x)dx = 1. Note

that P (X = a) = P (a ≤ X ≤ a) =a∫a

f (x)dx = 0. We

define E(X) =∞∫−∞

xf (x)dx and var(X) = E((X − µ)2) =

∞∫−∞

(x − µ)2f (x)dx, where µ = E(X). As for the finite case,

we can easily show that var(X) = E(X2) − (E(X))2 =

∞∫−∞

x2f (x)dx− µ2. Again σ2 = var(X).

The cumulative distribution function (c.d.f.) of X is defined

by F (x) =x∫−∞

f (t)dt. Then P (a ≤ x ≤ b) = F (b) − F (a).

Note that The Fundamental Theorem of Calculus implies that

F′(x) = f (x) for all x.

30

Page 31: 3E1 - Probability Notes

Example: X is a R.V. with p.d.f.

f (x) =

x2 , 0 ≤ x < 2

0, x < 0, 2 < x

P (1 ≤ X ≤ 1 · 5) =1·5∫1

x2dx = 5

16. E(X) =∞∫−∞

xf (x)dx =

2∫0

x2

2 dx = 43 and var(X) = E(X2) − (E(X))2 =

2∫0

x3

2 dx −

(43)2 = 29, so σ =

√23 .

If x < 0, then F (x) =x∫−∞

t2dt = 0.

If 0 ≤ x ≤ 2, then F (x) =x∫−∞

t2dt =

x∫0

t2dt = x2

4 .

If 2 < x, then F (x) =x∫−∞

t2dt =

2∫0

t2dt = 1, as we would

expect!

Recall the idea of a grouped frequency distribution:

Example The heights of 1,000 people.

62 − 63 means 62 ≤ h ≤ 63 etc. The relative frequencies31

Page 32: 3E1 - Probability Notes

are 501,000 = ·05 etc.

∑(rel. freqs.) = 1. The graph is now a

histogram:

The area of each rectangle represents the rel. freq. of that

group. The proportion of people of height less than any num-

ber is the sum of the areas up to that number. Joining the

midpoints of the tops of the rectangles gives a ”bell-shaped”

curve. For large populations the (relative) frequency function

approximates to such a curve.

The most important continuous random variable is the normal

distribution whose p.d.f. has a ”bell shape”.

Definition A R.V. is said to be normally distributed if its

p.d.f has the form f (x) = 1σ√2πe−

12(x−µσ )2. We say that X is

N(µ, σ2) if this f (x) is its p.d.f. Note that f (x) is symmetric

32

Page 33: 3E1 - Probability Notes

about x = µ and the bigger the σ the wider the graph of f (x)

is.

The following theorem shows that this f (x) gives a well-defined

p.d.f.

Theorem: (i)∞∫−∞

e−x22 dx =

√2π.

(ii)∞∫−∞

e−12(x−µσ )2dx =

√2πσ.

(iii) If X is N(µ, σ2), then E(X) = µ and var(X) = σ2.

Proof: See tutorial 3.

SupposeX isN(µ, σ2). Its c.d.f. is F (x) = 1σ√2π

x∫−∞

e−12(v−µσ )2dv

and P (a ≤ X ≤ b) = F (b)− F (a) = 1σ√2π

b∫a

e−12(v−µσ )2dv.

These integrals can’t be found analytically, so they are tabu-

lated numerically. This would have to be done for all values of

33

Page 34: 3E1 - Probability Notes

µ and σ. We do so for the case µ = 0 and σ = 1 and then use

the standardized normal R.V. Z = X−µσ .

For µ = 0, σ = 1 we write Z for the R.V. and z for the real

variable. We denote the p.d.f of Z by φ(z) = 1√2πe−

12z

2and its

c.d.f. by Φ(z) = 1√2π

z∫−∞

e−12u

2du.

Consider F (x) = 1σ√2π

x∫−∞

e−12(v−µσ )2dv, the c.d.f. of X . Let

u = v−µσ ; dv = σdu; as v −→ −∞, u −→ −∞ and when

v = x, u = x−µσ .

Therefore F (x) = 1√2π

x−µσ∫−∞

e−12u

2du = Φ(x−µσ ).

Hence P (a ≤ X ≤ b) = F (b)− F (a) = Φ(b−µσ )− Φ(a−µσ ).

From the tables we have: P (µ− σ ≤ X ≤ µ + σ) =

Φ(1)− Φ(−1) = P (−1 ≤ Z ≤ 1) = ·682,

P (µ− 2σ ≤ X ≤ µ + 2σ) =

Φ(2)− Φ(−2) = P (−2 ≤ Z ≤ 2) = ·954, and

P (µ− 3σ ≤ X ≤ µ + 3σ) =

Φ(3)− Φ(−3) = P (−3 ≤ Z ≤ 3) = ·997, etc.

34

Page 35: 3E1 - Probability Notes

The tables give values of Φ(z) for 0 ≤ z ≤ 3, in steps of 0 · 01.

Φ(z) = P (−∞ < Z ≤ z) and Φ(0) = P (−∞ < Z ≤ 0) = ·5.

Then Φ(z) = 1− Φ(−z) for z < 0.

Suppose that a < b.

(i) 0 ≤ a < b.

P (a ≤ Z ≤ b) = Φ(b)− Φ(a).

(ii) a < 0 < b.

P (a ≤ Z ≤ b) = Φ(b)− Φ(a)

= Φ(b)− (1− Φ(−a))

= Φ(b) + Φ(−a)− 1.

(iii) a < b < 0.

P (a ≤ Z ≤ b) = Φ(b)− Φ(a)

= 1− Φ(−b)− (1− Φ(−a))

= Φ(−a)− Φ(−b).

Example: For N(0, 1) find

35

Page 36: 3E1 - Probability Notes

(i) P (Z ≤ 2 · 44)

(ii) P (Z ≤ −1 · 16)

(iii) P (Z ≥ 1)

(iv) P (2 ≤ Z ≤ 10)

(i) P (Z ≤ 2 · 44) = Φ(2 · 44) = ·9927.

(ii) P (Z ≤ −1 · 16) = 1− P (Z ≤ 1 · 16) =

1− Φ(1 · 16) = 1− ·877 = ·123.

(iii) P (Z ≥ 1) = 1− P (Z < 1) = 1− Φ(1) = 1− ·8413 =

·1587.

(iv) P (2 ≤ Z ≤ 10) = Φ(10)− Φ(2) = 1− ·9772 = ·0228.

Example: For N(0, 1) find c if

(i) P (Z ≥ c) = 10%

(ii) P (Z ≤ c) = 5%

(iii) P (0 ≤ Z ≤ c) = 45%

(iv) P (−c ≤ Z ≤ c) = 99%.

(i) P (Z ≥ c) = 1−P (Z ≤ c) = 1−Φ(c), so 1−Φ(c) = ·1

or Φ(c) = ·9, giving c = 1 · 282.

(ii) P (Z ≤ c) = Φ(c), so Φ(c) = ·05, giving c = −1 · 645.

36

Page 37: 3E1 - Probability Notes

(iii) P (0 ≤ Z ≤ c) = Φ(c) − ·5, so Φ(c) = ·95, giving

c = 1 · 645.

(iv) P (−c ≤ Z ≤ c) = Φ(c) − Φ(−c) = 2Φ(c) − 1, so

2Φ(c) = 1 · 99 or Φ(c) = ·995, giving c = 2 · 576.

Example: For X = N(·8, 4) find

(i) P (X ≤ 2 · 44)

(ii) P (X ≤ −1 · 16)

(iii) P (X ≥ 1)

(iv) P (2 ≤ X ≤ 10)

(i) P (X ≤ 2·44) = F (2·44) = Φ(2·44−·82 ) = Φ(·82) = ·7939.

(ii) P (X ≤ −1 · 16) = F (−1 · 16) = Φ(−1·16−·82 ) =

Φ(− · 98) = ·1635.

(iii) P (X ≥ 1) = 1−P (X ≤ 1) = 1−F (1) = 1−Φ(1−·82 ) =

1− Φ(·1) = ·4602.

(iv) P (2 ≤ X ≤ 10) = F (10)−F (2) = Φ(10−·82 )−Φ(2−·82 ) =

Φ(4 · 6)− Φ(·6) = ·2743.

Example: Assume that the distance an athlete throws a

shotputt is a normal R.V. X with mean 17m and standard

37

Page 38: 3E1 - Probability Notes

deviation 2m. Find

(i) the probability that the athlete throws a distance greater

than 18 · 5m,

(ii) the distance d that the throw will exceed with 90% prob-

ability.

(i) P (X > 18 · 5) = 1 − P (X ≤ 18 · 5) = 1 − F (18 · 5) =

1− Φ(18·5−172 ) = 1− Φ(·75) = ·2266.

(ii) P (X > d) = ·95, so 1 − F (d) = ·95 or F (d) = ·05.

Hence Φ(d−172 ) = ·05, so d−172 = −1 · 645 giving d = 13 · 88m.

Example: The average life of a stove is 15 years with stan-

dard deviation 2 ·5 years. Assuming that the lifetime X of the

stoves is nomally distributed find

(i) The percentage of stoves that will last only 10 years or

less,

(ii) The percentage of stoves that will last between 16 and

20 years.

(i) P (X ≤ 10) = F (10) = Φ(10−152·5 ) = Φ(−2) = 1−Φ(2) =

·0228 = 2 · 28%.

38

Page 39: 3E1 - Probability Notes

(ii) P (16 ≤ X ≤ 20) = F (20) − F (16) = Φ(20−152·5 ) −

Φ(16−152·5 ) = Φ(2)− Φ(·4) = ·3218.

39

Page 40: 3E1 - Probability Notes

Jointly Distributed Random Variables

LetX, Y be finite R.V.s on the same sample space S with prob-

ability function P. Let the range ofX beRX = x1, x2, ..., xn

and the range of Y be RY = y1, y2, ..., ym respectively.

Consider the pair (X, Y ) defined on S by (X, Y )(s) = (X(s), Y (s)).

Then (X, Y ) is a R.V. on S with range ⊆ RX × RY =

(x1, y1), ..., (x1, ym), (x2, y1), ..., (x2, ym), ..., (xn, y1), ..., (xn, ym).

We sometimes call (X, Y ) a vector R.V.

Let Ai = s ∈ S | X(s) = xi = X = xi and

Bj = s ∈ S | Y (s) = yj = Y = yj. We write

Ai ∩Bj = X = xi, Y = yj. Define a function

h : RX ×RY −→ R by h(xi, yj) = P (Ai ∩Bj)

= P (X = xi, Y = yj). Then h(xi, yj) ≥ 0 and∑i,j

h(xi, yj) =

1, since the Ai∩Bj form a partition of S. h is called the joint

probability distribution function of (X, Y ) associated with the

probability function P and X, Y are said to be jointly dis-

tributed. Suppose that f and g are the p.d.f.s of X and Y

respectively. What is the connection between f, g and h?40

Page 41: 3E1 - Probability Notes

S = ∪mj=1Bj, soAi = Ai∩S = Ai∩(∪mj=1Bj) = ∪mj=1(Ai∩Bj),

disjoint. Therefore f (xi) = P (Ai) = P (∪mj=1(Ai ∩ Bj)) =

m∑j=1

P (Ai ∩ Bj) =m∑j=1

h(xi, yj). Similarly g(yj) =n∑i=1

h(xi, yj).

f and g are sometimes called the marginal distributions of h.

We often write the joint distribution in a table:

Example: Throw a pair of dice. LetX(a, b) = maxa, b

and Y (a, b) = a + b.

41

Page 42: 3E1 - Probability Notes

Definition: X and Y are independent if h(xi, yj) = f (xi)g(yj)

for all i, j.

This means that P (X = xi, Y = yj) = P (X = xi)P (Y = yj)

or P (Ai ∩Bj) = P (Ai)P (Bj) for all i, j.

Note that in the above example X and Y are not independent.

Definition: If G : R2 −→ R, then we define a R.V.

G(X, Y ) on S by G(X, Y )(s) = G(X(s), Y (s)) = G(xi, yj)

with p.d.f. h.

We now define the expectation and variance of G(X, Y ) as

E(G(X, Y )) =∑i,j

G(xi, yj)h(xi, yj) and

var(G(X, Y )) = E(G(X, Y )2)− (E(G(X, Y ))2.

Example: G(x, y) = x + y. Then (X + Y )(s) = X(s) +

Y (s) = xi + yj and E(X + Y ) =∑i,j

(xi + yj)h(xi, yj), etc.

Theorem: (i) For any R.V.sX and Y we haveE(X+Y ) =

E(X) + E(Y ).

(ii) If X and Y are independent, then var(X + Y ) =

var(X) + var(Y ). (This is not true in general).

Proof: (i) E(X + Y ) =∑i,j

(xi + yj)h(xi, yj)

42

Page 43: 3E1 - Probability Notes

=∑i

∑j

xih(xi, yj) +∑j

∑i

yjh(xi, yj)

=∑i

xi∑j

h(xi, yj)+∑j

yj∑i

h(xi, yj) =∑i

xif (xi)+∑j

yjg(yj)

= E(X) + E(Y ).

(ii) First we show that E(XY ) = E(X)E(Y ).

E(XY ) =∑i,j

xiyjh(xi, yj) =∑i,j

xiyjf (xi)g(yj)

= (∑i

xif (xi))(∑j

yjg(yj)) = E(X)E(Y ).

Now var(X + Y ) = E((X + Y )2)− (E(X + Y ))2

= E(X2 + 2XY + Y 2) − (E(X) + E(Y ))2 = E(X2) +

2E(X)E(Y ) + E(Y 2)− (E(X))2 − (E(Y ))2 − 2E(X)E(Y )

= E(X2)− (E(X))2+E(Y 2)− (E(Y ))2 = var(X)+var(Y ).

Important Example: Consider the Binomial Distribu-

tion B(n, p). The sample space is S × S × ... × S, n times,

where S = s, f. For 1 trial, n = 1 define the R.V. X by

X(s) = 1 and X(f ) = 0. Then E(X) = 1 × p + 0 × q = p

and var(X) = E(X2)− (E(X))2 = p− p2 = p(1− p) = pq.

For n trials define X1(a1, a2, ..., an) = X(a1), ...,

Xn(a1, a2, ..., an) = X(an), so that Xi is 1 if s is in the ith

place and 0 if f is in the ith place for all 1 ≤ i ≤ n.

43

Page 44: 3E1 - Probability Notes

Then E(Xi) = E(X) = p and var(Xi) = var(X) = pq.

Now Let Y = X1 + X2 + ... + Xn, so that Y gives the total

number of successes in the n trials. Then E(Y ) = E(X1) +

E(X2) + ... + E(Xn) = p + p + ... + p = np and var(Y ) =

var(X1) + var(X2) + ... + var(Xn) = npq.

44

Page 45: 3E1 - Probability Notes

Sampling Theory

Suppose that we have an infinite or very large finite sample

space S. This sample space is often called a population. Get-

ting information about the total population may be difficult,

so we consider much smaller subsets of the population, called

samples. We want to get information about the population

by studying the samples. We consider the samples to be ran-

dom samples i.e. each element of the population has the same

probability of being in a sample.

Example: Consider the population of Ireland. Pick a per-

son at random and consider the age of this person. Do this

n times. This gives a random sample of size n of the ages of

people in Ireland.

Mathematically the situation is described in the following way:

Let X be a random variable on a sample space S with proba-

bility function P and let f (x) be the probability distribution

function of X . Consider the sample space Ω = S×S× ...×S,

(n times) with the product probability function P i.e.45

Page 46: 3E1 - Probability Notes

P(A1 × A2 × ...× An) = P (A1)P (A2)...P (An).

For each 1 ≤ i ≤ n define a random variable Xi on Ω by

Xi(s1, s2, ..., sn) = X(si) where (s1, s2, ..., sn) ∈ Ω. Then the

probability distribution function of Xi is also f (x) for each

i. The vector random variable (X1, X2, ..., Xn) defined by

(X1, X2, ..., Xn)(s1, s2, ..., sn) = (X1(s1), X2(s2), ..., Xn(sn)) =

(x1, x2, ..., xn) is a random variable on Ω with joint distribution

P(X1 = x1, X2 = x2, ..., Xn = xn) = f (x1)f (x2)...f (xn).

Choosing a sample is simply applying the vector random vari-

able (X1, X2, ..., Xn) to Ω to get a random sample (x1, x2, ..., xn).

Each Xi has the same mean µ and variance σ2 as X and

they are independent, by definition. They are called indepen-

dent identically distributed random variables (i.i.d.). Func-

tions of the X1, X2, ..., Xn and numbers associated with them

are called statistics, while functions of the original X and as-

sociated numbers are called parameters. Our task is to get

information about the parameters by studying the statistics.

The mean µ and variance σ2 of X are called the population

46

Page 47: 3E1 - Probability Notes

mean and variance. We define two important statistics, the

sample mean and sample variance:

Definition: Sample mean X = X1+X2+...+Xnn ,

sample variance S2 =n∑1

(Xi−X)2

n−1 .

Then X(s1, s2, ..., sn) = x1+x2+...+xnn = x = µS and

S2(s1, s2, ..., sn) =n∑1

(xi−x)2n−1 .

We have

Theorem: (i) Expection of X = E(X) = µ, the popula-

tion mean,

(ii) Variance of X = σ2X

= σ2

n , the population variance over

n.

Proof: (i) E(X) = E(X1+X2+...+Xnn ) = E(X1)+E(X2)+...+E(Xn)

n =

µ+µ+...+µn = nµ

n = µ.

(ii) σ2X

= var(X) = var(X1+X2+...+Xnn ) = var(X1

n )+var(X2n )+

... + var(Xnn ) = var(X1)n2

+ var(X2)n2

+ var(Xn)n2

= nσ2

n2= σ2

n .

The reason for the n− 1 instead of n in the definition of S2 is

given by the following result.

47

Page 48: 3E1 - Probability Notes

Theorem: E(S2) = σ2, the population variance.

Proof: E(S2) = E(n∑1

(Xi−X)2

n−1 ) = 1n−1E(

n∑1

(Xi − X)2) =

1n−1E(

n∑1

(X2i − 2XiX + X

2)) =

1n−1[

n∑1E(X2

i )− 2E((n∑1Xi)X) + nE(X

2)] =

1n−1[

n∑1E(X2

i )− 2E((nX)(X)) + nE(X2)] =

1n−1[n(σ2 + µ2)− 2nE(X

2) + nE(X

2)] =

1n−1[n(σ2 + µ2)− nE(X

2)] = 1

n−1[n(σ2 + µ2)− n(σ2

n + µ2)] =

1n−1[(n− 1)σ2)] = σ2.

Note: If the mean or expectation of a statistic is equal to

the corresponding parameter, the statistic is called an unbi-

ased estimator of the parameter. Hence X and S2 are un-

biased estimators of µ and σ2 respectively. An estimate of

a population parameter given by a single number is called a

point estimate e.g. if we take a sample of size n and calculate

µS = x1+x2+...+xnn and S2 =

n∑1

(xi−x)2n−1 , then these are unbiased

point estimates of µ and σ2 respectively. We shall, however,

concentrate on interval estimates, where the parameter lies

within some interval, called a confidence interval.

48

Page 49: 3E1 - Probability Notes

Confidence Intervals for µ

Suppose we have n i.i.d. random variables X1, X2, ..., Xn with

E(Xi) = µ and var(Xi) = σ2 for each 1 ≤ i ≤ n. Then if

X = X1+X2+...+Xnn , we get E(X) = µ and var(X) = σ2

n .

As before, X1, X2, ..., Xn are jointly distrbuted random vari-

ables defined on the product sample space. We have the very

important result:

Central Limit Theorem: For large n (≥ 30) the prob-

ability distribution of X is approximately normal with mean

µ and variance σ2

n i.e. N(µ, σ2

n ) or, in other words, X−µσ√n

is

approximately N(0, 1). The larger the n the better the ap-

proximation.

Note: X1, X2, ..., Xn orX need not be normal. If, however,

they are normal, then X−µσ√n

is approximately N(0, 1) for all

values of n.

Recall N(0, 1).

49

Page 50: 3E1 - Probability Notes

If we want P (−z1 ≤ Z ≤ z1) = 95%, then Φ(z1) = 97.5%, so

z1 = 1.96. We say that −1.96 ≤ Z ≤ 1.96 is a 95% confidence

interval for N(0, 1).

Hence P (−1.96 ≤ X−µσ√n≤ 1.96) = 95%⇐⇒

P (−1.96× σ√n≤ X − µ ≤ 1.96× σ√

n) = 95%⇐⇒

P (−1.96× σ√n−X ≤ −µ ≤ 1.96× σ√

n−X) = 95%⇐⇒

P (1.96× σ√n

+ X ≥ µ ≥ −1.96× σ√n

+ X) = 95%⇐⇒

P (X − 1.96× σ√n≤ µ ≤ X + 1.96× σ√

n) = 95%.

If we know σ this gives us a 95% confidence interval for µ i.e.

given any random sample there is a 95% probability that µ lies

within the above interval or we can say with 95% confidence

that µ is between the two limits of the interval. Put another

way, 95% of samples will have µ in the above interval.

Example: A sample of size 100 is taken from a population

with unknown mean µ and variance 9. Determine a 95% con-

fidence interval for µ if the sample mean is 5.

Here X = 5, σ = 3 and n = 100.

P (X − 1.96× σ√n≤ µ ≤ X + 1.96× σ√

n) = 95%⇐⇒

50

Page 51: 3E1 - Probability Notes

P (5− 1.96× 310 ≤ µ ≤ 5 + 1.96× 3

10) = 95%⇐⇒

P (4.412 ≤ µ ≤ 5.588) = 95%.

Example: A sample of size 80 is taken from the workers

in a very large company. The average wage of the sample of

workers is 25,000 euro. If the standard deviation of the whole

company is 1,000 euro, construct a confidence interval for the

mean wage in the company at the 95% level.

Here X = 25, 000, σ = 1, 000 and n = 80.

P (X − 1.96× σ√n≤ µ ≤ X + 1.96× σ√

n) = 95%⇐⇒

P (25, 000 − 1.96 × 1,000√80≤ µ ≤ 25, 000 + 1.96 × 1,000√

80) =

95%⇐⇒ P (24, 781 ≤ µ ≤ 25, 219) = 95%.

We can have different confidence intervals:

Let α be a small percentage ( 5% above). Then

P (X − zα2× σ√

n≤ µ ≤ X + zα

2× σ√

n) = 1 − α gives an α

confidence interval, where Φ(zα2) = 1− α

2 .

51

Page 52: 3E1 - Probability Notes

Note: For a 95% interval we have Φ(zα2) = .975, so zα

2=

1.96 and for a 99% interval Φ(zα2) = .995, so zα

2= 2.575.

Example: Determine a 99% confidence interval for the

mean of a normal population if the population variance is σ2 =

4.84, using the sample 28, 24, 31, 27, 22.

(Note that we need normality here since the sample size < 30).

X = 28+24+31+27+225 = 26.4 and σ =

√4.84 = 2.2. Then

P (26.4− 2.575×2.2√5≤ µ ≤ 26.4− 2.575×2.2√

5) = .99⇐⇒

P (23.867 ≤ µ ≤ 28.933) = .99.

Example: If we have a normally distributed population

with σ2 = 9, how large must a sample be if the 95% confidence

interval has length at most 0.4?

In general the length of the confidence interval is

(X + zα2× σ√

n) − (X − zα

2× σ√

n) = 2zα

2× σ√

n. For the 95%

confidence interval Zα2

= 1.96, so 2 × 1.96 × 3√n≤ .4, which

gives√n ≥ 2×1.96×3

.4 or n = 865.

In all the previous examples we knew σ2, the population vari-

ance. If that is not so and n ≥ 30 we can use S2 as a point

52

Page 53: 3E1 - Probability Notes

estimate for σ2 and assume that X−µS√n

is approximatelyN(0, 1).

Example: A watch-making company wants to investigate

the average life µ of its watches. In a random sample of 121

watches it is found that X = 14.5 years and S = 2 years.

Construct a (i) 95%, (ii) 99% confidence interval for µ.

(i) 14.5− 1.96× 211 ≤ µ ≤ 14.5 + 1.96× 2

11 ⇐⇒

14.14 ≤ µ ≤ 14.86.

(ii) 14.5− 2.575× 211 ≤ µ ≤ 14.5 + 2.575× 2

11 ⇐⇒

14.03 ≤ µ ≤ 14.97.

Note that the greater the confidence the greater the interval.

If n is small (< 30) this is not very accurate, even if the original

X is normal. In this case we must use the following:

Theorem: If X1, X2, ..., Xn are independent normally dis-

tributed random variables, each with mean µ and variance σ2,

then the random variable X−µS√n

is a t-distribution with n − 1

degrees of freedom.

We denote the number of degrees of freedom n− 1 by ν. For

each ν the t-distribution is a symmetric bell-shaped distribu-

53

Page 54: 3E1 - Probability Notes

tion.

For ν = ∞ we get the standard normal distribution N(0, 1).

The statistical tables usually read P (|T | > k) for each ν.

Example: ν = 5. P (|T | > k) = 0.01 Then k = 2.015, so

P (−2.015 ≤ T ≤ 2.015) = 99%.

Example: A certain population is normal with unknown

mean and variance. A sample of size 20 is taken. The sample

mean is 15.5 and the sample variance is 0.09. Obtain a 99%

confidence interval for µ, the population mean.

54

Page 55: 3E1 - Probability Notes

Since n = 20 < 30 we must use the t-distribution with ν =

n− 1 = 19. We have X = 15.5 and S2 = 0.09. For ν = 19 we

have P (|T | > k) = 0.01 giving k = 2.861.

Now X−µS√20

is a t-distribution with 19 degrees of freedom, so

P (−2.861 ≤ 15.5−µ0.3√20

≤ 2.861) = 99%⇐⇒

P (15.5− 2.861×0.3√20≤ µ ≤ 15.5 + 2.861×0.3√

20) = 99%⇐⇒

P (15.308 ≤ µ ≤ 15.692) = 99%.

Example: Five independent measurements, in degrees F,

of the flashpoint of diesel oil gave the reults 144, 147, 146, 142, 144.

Assuming normality, determine a (i) 95%, (ii)99% confidence

interval for the mean flashpoint.

Since n < 30 we must apply the t-distribution. n = 5, so

ν = 4. We have X = 144+147+146+142+1445 = 144.6. Also

S2 = (−.6)2+(2.4)2+(1.4)2+(−2.6)2+(−.6)25 = 3.8, so S = 1.949.

(i) P (144.6−2.776×1.949√5

≤ µ ≤ 144.6+2.776×1.949√5

) = 95%⇐⇒

P (142.18 ≤ µ ≤ 147.02) = 95%.

(ii) P (144.6 − 4.604×1.949√5

≤ µ ≤ 144.6 + 4.604×1.949√5

) =

99%⇐⇒ P (140.59 ≤ µ ≤ 148.61) = 99%.

55

Page 56: 3E1 - Probability Notes

Hypothesis Testing

Suppose that a claim is made about some parameter of a pop-

ulation, in our case always the population mean µ. This claim

is called the null hypothesis and is denoted by H0. Any claim

that differs from this is called an alternative hypothesis, de-

noted by H1. We must test H0 against H1.

Example: H0 : µ = 90.

Possible alternatives are

H1 : µ 6= 90

H1 : µ > 90

H1 : µ < 90

H1 : µ = 95.

We must decide whether to accept or reject H0. If we reject

H0 when it is in fact true we commit what is called a type I

error and if we accept H0 when it is in fact false we commit a

typeII error. The maximum probability with which we would

be willing to risk a type I error is called the level of significance

of the test, usually 10%, 5% or 1%. We perform a hypothesis56

Page 57: 3E1 - Probability Notes

test by taking a random sample from the population.

Suppose that we are given H0 : µ = µ0, some fixed value.

(i) We suspect that µ 6= µ0. This is our H1.

We take a random sampleX . We might haveX > µ0, X < µ0

or X = µ0. Now if the mean is µ0, then X is approximately

N(µ0, σ2) or X−µ0

σ√n

is approximately N(0, 1).

There is a 5% probability that X is in either of the end re-

gions of N(µ0, σ2) or, equivalently, X−µ0σ√

nis in either of the end

regions of N(0, 1). If our X−µ0σ√n

is in this ”rejection region”

we reject H0 at th 5% significance level. Otherwise we do not

reject H0. This is called a ”two-tailed” test.

(ii) We suspect that µ > µ0. This is our H1.

Our X is now > µ0 (this is why we suspect that µ > µ0.) We

only check for probability on the right-hand side.

57

Page 58: 3E1 - Probability Notes

Again if the mean is µ0, then if our X−µ0σ√n

is in this ”rejection

region” we reject H0 at th 5% significance level. Otherwise we

do not reject H0. This is called a ”one-tailed” test.

Note that a bigger µ0 may push X into the non-rejection re-

gion.

(iii) We suspect that µ < µ0. This is our H1.

This is the same as (ii) but on the left.

58

Page 59: 3E1 - Probability Notes

Example: A battery company claims that its batteries

have an average life of 1,000 hours. In a sample of 100 bat-

teries it was found that X = 985 hours and S = 30 hours.

Test the hypothesis H0 : µ = 1, 000 hours against the alterna-

tive hypothesis H1 : µ 6= 1, 000 hours at the 5% significance

level, assuming that the lifetime of the batteries is normally

distributed.

n = 100 > 30 so we can take S for σ. If µ = 1, 000, then X−µS√n

is approximately N(0, 1). We are interested in extreme values

of X on both sides of µ = 1, 000 so we use a ”two-tailed” test.

Values of X−µS√n

will be between -1.96 and 1.96 95% of the time.

For our sample X−µS√n

= 985−1,00030√100

= −5, which is (deep) in the

rejection region. So we reject H0 at the 5% significance level.

There is a 5% probability of a type I error.

59

Page 60: 3E1 - Probability Notes

Example: A researcher claims that 10 year old children

watch 6.6 hours of television daily. In a sample of 100 it was

found thatX = 6.1 hours and S = 2.5 hours. Test the hypoth-

esis H0 : µ = 6.6 hours against the alternative H1 : µ 6= 6.6

hours at the (i) 5%, (ii)1% significance levels.

n = 100 > 30, so we can take S for σ.

Then X−µS√n

is approximately N(0, 1).

(i) If µ = 6.6, then X−µS√n

is between -1.96 and 1.96 with

probability 95%. But X−µS√n

= 6.1−6.62.510

= −2 is in the rejection

region. We reject H0 at the 5% level.

(ii) If µ = 6.6, then X−µS√n

is between -2.575 and 2.575 with

probability 99%. But, as above, X−µS√n

= 6.1−6.62.510

= −2, which

now is in the non-rejection region. We do not reject H0 at the

1% level.

60

Page 61: 3E1 - Probability Notes

Example: A manufacturer produces bulbs that are sup-

posed to burn with a mean life of at least 3,000 hours. The

standard deviation of 500 hours. A sample of 100 bulbs is

taken and the sample mean is found to be 2,800 hours. Test

the hypothesis H0 : µ ≥ 3, 000 hours against the alternative

H1 : µ < 1, 000 hours at the 5% significance level.

In this case if ourX value is greater than 3,000 we do not reject

it since it agrees with H0, so we are only interested in extreme

values on the left. We use a ”one-tailed” test. Again X−µσ√n

is approximately N(0, 1) and X−µσ√n≥ −1.645 with a proba-

bility of 95%. But X = 2, 800, n = 100 and σ = 500, so

X−µσ√n

= 2,800−3,00050010

= −4. Hence we reject H0 at the 95% level.

We also need to use the t-distribution.

Example: We need to buy a length of a certain type of wire.

The manufacturer claims that the wire has a mean breaking

61

Page 62: 3E1 - Probability Notes

limit of 200 kg or more. We suspect that the mean is less. We

have H0 : µ ≥ 200 and H1 : µ < 200. We take a random

sample of 25 rolls of wire and find that X = 197 kg and

S = 6 kg. Test H1 against H0 at the 5% level, assuming the

breaking limit of the wire is normally distributed.

Here n = 25 < 30, so we must use a t-distribution with ν = 24.

If the mean is µ, then X−µS√n

is at-distribution with 24 degrees

of freedom.

P (|T | > 1.711) = 10% and X−µS√n

= 197−2006√25

= −2.5, which is

in the rejection region. We reject H0.

62