slidesutsc.utoronto.ca/~butler/notes/stab52.pdftitle slides.dvi created date 11/12/2006 3:41:16 pm

Welcome to

STAB52

Instructor: Dr. Ken Butler

1

Contact information

(on Intranet: intranet.utsc.utoronto.ca, My Courses)

• E-mail: [email protected]

• Office: H 417

• Office hours: to be announced

• Phone: 5654 (416-287-5654)

2

Probability Models

3

Measuring uncertainty

In life, often faced with own ignorance:

• don’t know what winning lottery number will be

• don’t know tomorrow’s weather

• don’t know what traffic will be like on way home tonight

• don’t know who next mayor of Toronto will be

This course will not give you answers to above, but will teach you

how to recognize own ignorance and work with it.

4

Why we need probability theory

Consider a couple of apparently reasonable gambles:

• A friend has 3 cards, one red both sides, one black both sides,

the other one colour each side. Friend picks card at random,

places on table. Side showing is red. Offers $4 against your $3

that other side also red.

• Another friend suggests flipping a (fair) coin 1000 times. If coin

comes up heads 600 or more times, he pays you $100, else you

pay him just $1.

5

Cards: might think either colour equally likely, in which case bet is

good one. But in fact other side will be red 2/3 of the time. Losing

bet in long run. (Conditional probability, expectation.)

Coin: getting 600 or more heads very unlikely, even to make this bet

pay off in long run. (Law of large numbers, expectation.)

If you have friends like this, get some new ones!

6

Many things in life involve assessing risk, and making decision in

light of risk:

• the bets above

• predicting earthquakes

• walking down the street

• whether to buy a house

• whether machinery in factory will work properly

• in science, what experiment to do, and how to interpret results

• insurance

7

Probability Models

To talk about risk, need to assess:

• what might happen

• how likely each thing is to happen.

List of all possible outcomes called sample space , denoted S.

Eg. in Lotto 6/49, you pick 6 numbers, and get prize according to

how many of winning numbers you match (including “bonus”), so

S = {0, 2 + b, 3, 4, 5, 5 + b, 6}

Things in S called outcomes . 0 means “no prize”.

8

Outcomes, events, probability measure

Subsets of S are called events . Example: {0, 2 + b} means “no

prize or two-plus-bonus prize”.

A probability measure must assign to each event A a probability,

written P (A). P has these properties:

• 0 ≤ P (A) ≤ 1

• P (∅) = 0 (“impossible for nothing to happen”)

• P (S) = 1 (“something must happen”)

• If A1, A2 . . . are disjoint events,

P (A1 ∪ A2 ∪ · · · ) = P (A1) + P (A2) + · · · .

“Disjoint” means the events have no outcomes in common.

9

So {0, 2 + b} and {4, 5} are disjoint, but {0, 2 + b} and

{2 + b, 3} are not.

Last property says

P ({0, 2 + b, 4, 5}) = P ({0, 2 + b}) + P ({4, 5})

but says nothing about P ({0, 2 + b, 3}).

10

Probability model

A probability model consists of a (nonempty) sample space S,

subsets of S called events, and a probability measure P satisfying

properties above.

Usually give probability measure for each outcome, eg. for lottery:

11

Outcome Probability

6 0.00000007

5 + b 0.0000004

5 0.000018

4 0.000968

3 0.017544

2 + b 0.0123457

0 0.96875

These sum to 1 (to accuracy shown).

12

Example 2: flip a fair coin. It can come up heads (H) or tails (T), so

S = {H,T}. Also

1 = P (S) = P ({H,T}) = P ({H}) + P ({T})

so P ({H}) = P ({T}) = 0.5 (fair coin, sum to 1).

Example 3: flip 2 fair coins. Then

S = {HH,HT, TH, TT}

with any outcome equally likely, so eg. P (HT ) = 14. But

P (1 head) = P ({HT}) + P ({TH}) = 12, different from

P (0 heads) = P (2 heads) = 14.

13

Venn Diagrams

When handling events, ie. subsets of S, nice to be able to draw

picture of what we mean (easier for thinking, too).

A Venn Diagram is a rectangle representing S containing circles

representing events A,B, . . .. Some pictures on later pages.

Definitions:

• Subset Ac = {s : s 6∈ A} called complement of A.

• Intersection A ∩ B = {s : s ∈ A and s ∈ B}.

• Union A ∪ B = {s : s ∈ A or s ∈ B}.• A ∩ Bc called complement of B in A: “in A but not in B”.

If A and B disjoint, draw as two non-overlapping circles.

14

15 16

17 18

Facts from Venn diagrams

In the 3rd diagram, the area outside the two circles is “(everything

not in A) and (everything not in B)”, ie. Ac ∩ Bc.

Now, A ∪ B is everything inside the two circles (“in A or B or

both”), and everything outside the two circles is (A ∪ B)c. Thus

(A ∪ B)c = Ac ∩ Bc.

Similar logic gives

(A ∩ B)c = Ac ∪ Bc :

the elements not in (both A and B) are those (not in A) or (not in

B).

19

Summary

• Sample space contains outcomes, events (collections of

outcomes)

• Probability measure gives prob. of events (outcomes), between

0 and 1.

• Prob. of whole sample space is 1; probs of disjoint events add.

• Probability model contins sample space, events, probability

measure.

• Venn diagrams: picture of events, disjoint or not.

20

Properties of probability models

Ac denotes event “A does not happen”. (Flipping 2 coins: if

A = HH , then Ac = {HT, TH, TT}.)

A and Ac are disjoint, and A ∪ Ac = S (must either flip HH or

not-HH). But P (A ∪Ac) = P (A) + P (Ac) and P (S) = 1. Thus

P (A) + P (Ac) = 1 or P (Ac) = 1 − P (A).

This is often easiest way to find prob. of an Ac.

Coin-flipping ex.: P (A) = 14, so P (Ac) = 1 − 1

4= 3

4.

21

Total probability version 1

Above, A and Ac were disjoint, and A ∪ Ac = S. Disjoint events

whose union is S called partition of S. Let A1, A2, . . . , An be a

partition of S. Suppose we have another event B. What can we say

about P (B)?

22

Draw a Venn diagram. B consists of the bit of B intersecting with

A1, the bit intersecting with A2, etc. In symbols:

B = (A1 ∩ B) ∪ (A2 ∩ B) ∪ · · · ∪ (An ∩ B)

and, using addition rule,

P (B) = P (A1 ∩ B) + P (A2 ∩ B) + · · · + P (An ∩ B);

because the Ai are disjoint, the Ai ∩ B are too.

Called law of total probability .

This is version 1 of total prob. law; see version 2 (more useful) later.

23

When B is a subset of A

When B ⊆ A, A contains all outcomes in B plus more, so expect

P (B) ≤ P (A). Get to that in a minute.

A has 2 parts: the bit intersecting with B, and the bit not. Hence

A = (A ∩ B) ∪ (A ∩ Bc).

But A ∩ B is just B, and these 2 parts of A are disjoint. (Or think

of B,Bc as partition of S).

24

Hence

P (A) = P (B) + P (A ∩ Bc).

Two followups to this, the second because P (A ∩ Bc) ≥ 0:

P (A ∩ Bc) = P (A) − P (B)

P (A) ≥ P (B).

25

Inclusion-exclusion

Back to general A and B.

If A,B disjoint, then P (A ∪ B) = P (A) + P (B). But if not?

Draw a Venn diagram of the general case where A and B overlap.

Shade A and B (next page), find you shaded A ∩ B twice.

Therefore need to subtract once (“exclude”) to get

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

26

27

Or, to show mathematically, need to be careful. A ∪ B consists of:

the bit in A but not B, the bit in B but not A, and the bit in both A

and B. These bits are disjoint, so

A ∪ B = (A ∩ Bc) ∪ (B ∩ Ac) ∪ (A ∩ B)

and

P (A ∪ B) = P (A ∩ Bc) + P (B ∩ Ac) + P (A ∩ B).

28

By total probability, P (A) = P (A ∩ B) + P (A ∩ Bc) so

P (A ∩ Bc) = P (A) − P (A ∩ B);

likewise

P (B ∩ Ac) = P (B) − P (A ∩ B).

Hence

P (A ∪ B) = (P (A) − P (A ∩ B)) + (P (B) − P (A ∩ B))

+ (P (A ∩ B))

= P (A) + P (B) − P (A ∩ B).

as we wanted.

29

Example: suppose that an employee arrives late with probability

0.10, leaves early with probability 0.05, and does both with

probability 0.02. What is the probability that the employee will either

arrive late, leave early, or both?

Let A be “arrives late” and B be “leaves early”. Want P (A ∪ B),

but A and B are not disjoint (both can happen). Thus:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

= 0.10 + 0.05 − 0.02 = 0.13.

30

Summary

• Prob. of “not A” is 1 − P (A).

• If A1, . . . , An is partition of S, can write P (B) in terms of

P (Ai ∩ B) — total probability.

• When B is subset of A, P (A ∩ Bc) = P (A) − P (B) and

P (A) ≥ P (B).

• For any A and B, P (A∪B) = P (A) + P (B)−P (A∩B).

31

Equally likely outcomes

If the outcomes in S are equally likely, and there are |S| of them,

probability of each is 1/|S|.If event A has |A| outcomes in it, additivity says

P (A) =|A||S| .

(As a check, P (S) = |S|/|S| = 1 as it should be.)

Advantage of this: can find P (A) by counting number of outcomes

in A.

32

Some examples

• Flipping a fair coin. S = {H,T} so |S| = 2 and

P (H) = P (T ) = 12.

• Rolling a fair (six-sided) die. Now S = {1, 2, 3, 4, 5, 6} and

|S| = 6, so eg. P ({3}) = 16

and P ({2, 3, 4}) = 36.

• Flip a fair coin and roll a fair 6-sided die. S =

{H1, H2, H3, H4, H5, H6, T1, T2, T3, T4, T5, T6}.

Now |S| = 12 and P (s) = 112

for any outcome s.

33

Combinatorial principles

Said that we can find a P (A) by counting ways in which A can

happen. Look at some ways to do this.

Multiplication principle

In 3rd example above, flipped a coin and rolled a die. Turned out to

be 2 × 6 = 12 possible outcomes, with 2 possible coin flips and 6

possible die rolls.

Mathematically: have sample spaces S1, . . . , Sk. Get one outcome

from each. Total possible outcomes |S1| × |S2| · · · × |Sk|.

34

Same applies to probabilities, if selections from each sample space

independent. That is, as long as knowing one result tells you

nothing about other results.

Example: flip a fair coin, roll a fair die (independently). P (H) = 12,

P ({5, 6}) = 26, so

P (H and {5, 6}) =1

2· 2

6=

1

6.

Another example: roll 2 fair dice, count total number of spots.

P (total 10)?

6 × 6 = 36 possible outcomes, equally likely. Those giving total 10

are (4, 6), (5, 5), (6, 4), so P (total = 10) = 336

.

35

Permutations

What if we make more than one selection from the same set? Not

independent any more.

If the order of selection matters, dealing with permutations .

Example: 4 people eating lunch. How many ways can they sit at a

4-person table?

First person to sit down chooses one of the 4 seats. Then 2nd

person chooses one of the 3 seats left, 3rd person chooses 1 of 2

seats left, last person sits in remaining seat. Number of ways:

4 × 3 × 2 × 1 = 24 = 4!, called factorial of 4.

36

In general:

• number of ways to arrange k items, if order matters, is k!

• number of ways to select k items out of n, if order matters, is

n(n − 1) · · · (n − k + 1) = n!/(n − k)!.

Example of latter: a student society wants to choose a President,

Vice-President and Secretary from its 10 members (so order

matters). Number of ways to do this:

10!

7!= 720

or just 10 × 9 × 8 = 720.

37

Counting subsets

Consider student society above.

Already saw: 720 ways to choose 3 of 10 if order matters. Think of

this as two-stage process:

• first choose 3 members to be society officers

• then decide who gets which role

Know final answer is 720. If can figure out ways to get from 1st

stage to 2nd, can figure out number of ways to do 1st stage.

38

If we have chosen the 3 officers, how many ways can they be

assigned to President, Vice-President and Secretary? This is

3! = 6.

So number of ways to choose the 3 officers from the 10 members

(with order not mattering) is 720/6 = 120.

In general: take the number of permutations, divide by factorial of

number of items, so number of subsets of k items out of n is(

n

k

)

=n!

k!(n − k)!,

read “n choose k”.

39

In our student society, n = 10 and k = 3, giving(

10

3

)

=10!

3!7!= 120

as above.

Example 2: suppose we flip 5 fair coins. What is probability of

getting exactly 3 heads?

2 possible outcomes for each coin flip (H/T). So

2 × 2 × 2 × 2 × 2 = 25 = 32 possible equally likely outcomes.

Out of these,(

53

)

= 10 have 3 heads, so P (3 heads) = 1032

.

40

Example 3: suppose coins no longer fair; let P (H) = θ, so

P (T ) = 1− θ (not 12). Outcomes no longer equally likely, but those

outcomes giving 3 heads and 2 tails each have prob. θ3(1 − θ)5−3.

Still(

53

)

= 10 of them, so P (3 heads) = 10θ3(1 − θ)2.

More generally, P (k heads) when flipping n coins is

(

n

k

)

θk(1 − θ)n−k.

The(

nk

)

called binomial coefficients : what you get if you expand

(θ + (1 − θ))n in a binomial series.

41

Sets divided into more than 2 types

Above, two types of outcome: “heads”/”tails”, “selected/not

selected”. Suppose there are now more than two:

Student society again. Now choose 3 officers, 4 members to form a

committee, out of 10 members total.

One approach: select 3 officers out of 10 members, then out of

10− 3 = 7 remaining members, select 4 to be on committee. Then

multiply to get number of ways as(

10

3

)(

7

4

)

= 120 × 35 = 4200.

42

Or write as factorials:

10!

3!7!· 7!

4!3!=

10!

3!4!3!= 4200.

In general, number of ways to divide set of n items into subsets of

sizes k1, k2, . . . , kl with k1 + · · · + kl = n is(

n

k1 k2 . . . kl

)

=n!

k1!k2! · · · kl!,

called a multinomial coefficient . As with a binomial coefficient,

have to include those items not explicitly selected, eg. those

members who are not officers or on the committee.

43

Summary

• Equally likely outcomes: count number in A and S, then

P (A) = |A|/|S|.

• Multiplication: a ways to do A, b to do B, ab ways to do A and

B.

• Selections of r objects from set of n:

– if order matters, n!/(n − r)! ways (permutation)

– if order does not matter,(

nr

)

= n!/{r!(n − r)!} ways

(binomial coeff).

• extensions to more than 2 types of object in set (multinomial

coeff).

44

Conditional probability and independence

Suppose we flip 3 fair coins. P (first coin H) = 12.

But now, suppose as well that someone tells us that 2 of the 3 coins

came up H. Since there were “more heads than average”, this

changes our opinion about P (first coin H): expect it to be higher

than 12

now.

45

We know that 2 of the 3 coins came up H, so can ignore any events

not containing two H – that is, one of HHT, HTH, THH must have

happened. Out of these, 2 of the three have the first coin H. In

symbols:

P (1st coin H | 2 coins H) =2

3,

given that 2 coins were H, prob. that 1st coin is H is now 23. Called

conditional probability .

For general events A and B, work out P (A|B) by saying “out of all

the ways B can happen, how many have A happen as well”, ie.

P (A|B) =P (A ∩ B)

P (B).

46

In coin example, A = {HHH,HHT,HTH,HTT} ( “H on 1st

flip”) and B = {HHT,HTH, THH} (“two H”).

A ∩ B = {HHT,HTH}, and there are 23 = 8 possible

outcomes altogether, so

P (A|B) =P (A ∩ B)

P (B)=

2/8

3/8=

2

3.

Definition of conditional probability P (B|A) can be rearranged:

P (A ∩ B) = P (A) P (B|A);

in words, prob. of A and B both happening is prob. of A happening

times prob. of B happening given that A has happened. This is

general multiplication rule: can calculate P (A ∩ B) regardless of

whether or not A happening affects chance of B happening.

47

Total probability (again)

Recall total probability: if A1, A2 . . . , An is a partition of S, and B

is some event,

P (B) = P (A1 ∩ B) + P (A2 ∩ B) + · · · + P (An ∩ B).

Replace P (Ai ∩ B) with P (Ai) P (B|Ai):

P (B) = P (A1) P (B|A1)+P (A2) P (B|A2)+· · ·+P (An) P (B|An).

Called law of total probability, version 2 .

48

Often use this law in two-stage systems: make 1 choice, then

depending on that choice, make a 2nd choice.

Example: suppose pot 1 has 3 red balls and 2 blue balls, and pot 2

has 1 red ball and 3 blue balls. If I choose pot 1 with prob. 23, what is

prob. of drawing a red ball?

Let B be “drawing a red ball” and let Ai be “choose pot i”.

Then P (B|A1) = 35

(“if I choose pot 1, there are 3 red balls out of

5 to choose”), and P (B|A2) = 14. Thus

P (B) =2

3· 3

5+

1

3· 1

4=

29

60.

Near 50-50 chance: likely to choose pot with more red balls, but

may choose pot with very few red balls.

49

Bayes’ theorem

How do we reverse conditional probabilities, ie. from P (B|A) how

do we get P (A|B)? Using definitions:

P (A|B) =P (A ∩ B)

P (B)=

P (B ∩ A)

P (B)·P (A)

P (A)=

P (A)

P (B)P (B|A).

Formula called Bayes’ theorem .

Revisit previous example: what is prob. I picked pot 1, given that I

chose a red ball?

In notation of that example: want P (A1|B) = P (A1)P (B)

P (B|A1).

50

P (A1) = 23

(given before), P (B|A1) = 35, P (B) = 29

60(found

before). Thus

P (A1|B) =2/3

29/60· 3

5=

24

29.

Very likely to have picked pot 1 if we chose red ball.

Often, as here, need law of total probability to figure out P (B).

51

Example: Testing for disease

Let A be event “have particular disease”, B be event “test positive

for that disease”. Typically P (A) = 0.01,

P (B|A) = 0.95, P (B|Ac) = 0.10: rare disease, test tends to be

accurate. If we pick a person at random:

P (B) = P (A)P (B|A) + P (Ac)P (B|Ac)

= (0.01)(0.95) + (0.99)(0.10) = 0.1085

about 11% chance of testing positive.

52

Real interest: prob. of having disease if you test positive:

P (A|B) =P (A)

P (B)P (B|A) =

0.01

0.1085· 0.95 = 0.0876 :

for an apparently accurate test, this is surprisingly small.

Reason: for a rare disease, large majority of positive tests will come

from people who don’t have disease (even if positive tests in that

case rare), because disease even rarer.

Compare numbers on previous page: of 100 people, about 1 will

have disease, but about 11 will test positive (so about 10 of those

are false alarms).

53

Independence of events

Let’s recycle an old example:

• Flip a fair coin and roll a fair 6-sided die. S =

{H1, H2, H3, H4, H5, H6, T1, T2, T3, T4, T5, T6}.

|S| = 12 and P (s) = 112

for any outcome s.

Suppose we know the coin came up T. Then eg.

P (die = 3|coin = T ) =1

6

since only look at last 6 outcomes.

But P (die = 3) = 16

as well – that is, knowing coin was T told us

nothing extra about die prob. In other words, coin and die results are

independent : knowing one tells us nothing about the other.

54

Mathematically: suppose A and B are independent events. Then

P (A ∩ B) = P (A) P (B|A) = P (A) P (B).

Make this definition: if P (A ∩ B) = P (A) P (B), then A and B

are independent.

55

With more than two events, gets more complicated. Eg. with 3

events, A,B,C , need all of these true:

P (A ∩ B) = P (A) P (B)

P (A ∩ C) = P (A) P (C)

P (B ∩ C) = P (B) P (C)

P (A ∩ B ∩ C) = P (A) P (B) P (C)

Let S = {1, 2, 3, 4} equally likely, let A = {1, 2}, B = {1, 3},

C = {1, 4}. Then A,B,C satisfy first 3 above, but not 4th, so not

independent (called pairwise independent).

56

Example

Events A and B have

P (A) = 0.4, P (B) = 0.2, P (A ∩ B) = 0.1. Are A and B

independent?

Check: P (A) P (B) = (0.4)(0.2) = 0.08 6= 0.1 = P (A ∩ B),

so A and B are not independent. In fact,

P (A|B) =P (A ∩ B)

P (B)=

0.1

0.2= 0.5,

so if B happens, A is more likely to happen as well.

57

Independence and disjointness

Are independent events and disjoint events the same thing?

Short answer: NO!

Long answer: Two events A and B can be:

• independent , if P (A ∩ B) = P (A) P (B)

• related , otherwise (knowledge of A tells you about P (B)).

Related events can be:

– disjoint if P (A ∩ B) = 0,

– overlapping if P (A ∩ B) > 0.

For disjoint events, if A happens, B cannot happen: that is,

knowing that A happens tells you that P (B|A) = 0.

58

Summary

• Conditional probability P (A|B) is prob. of A if we know that B

has happened. P (A|B) = P (A ∩ B)/P (B).

• If A1, . . . , An partition of S, law of total prob. version 2 gives

P (B) in terms of P (B|Ai) P (Ai).

• Bayes’ theorem gives P (A|B) in terms of P (B|A).

• Total prob. and Bayes’ theorem are key tools for working with

conditional probs.

• A and B independent if P (A|B) = P (A) for all B or

P (A ∩ B) = P (A) P (B).

• Independent and disjoint events not the same!

59

Random Variables and

Distributions

60

Random Variables

Suppose we flip two (fair) coins, and note whether each coin

(ordered) comes up H or T.

• Sample space is S = {HH,HT, TH, TT}.

• Probability measure is 14

for each of 4 outcomes.

What about “number of heads”? Could be 0, 1 or 2:

• P (0 heads) = P (TT ) = 14

• P (1 head) = P (TH) + P (HT ) = 12

• P (2 heads) = P (HH) = 14.

61

“Number of heads” is random variable : function from S to R. That

is, given outcome, get value of random variable.

Random variables can be any function from S to R. If

S = {rain, snow, clear}, random variable X could be

X(rain) = 3

X(snow) = 6

X(clear) = −2.7.

62

Some more examples of random variables

Roll a fair 6-sided die, so that S = {1, 2, 3, 4, 5, 6}. Let X be the

number of spots showing, let Y be square of number of spots. If s is

number of spots, on a particular roll, let W = s + 10, let

U = s2 − 5s + 3, etc.

In previous situation, let C = 3 regardless of s. C is constant

random variable.

Suppose have event A, only interested in whether A happens or

not. Define indicator random variable I to be 1 if A happens, 0

otherwise. Example (rolling die) I6(s) = 1 if s = 6, 0 otherwise.

63

≥, =, sum for random variables

Imagine rolling a fair die again, S = {1, 2, 3, 4, 5, 6}. Let X = s,

and let Y = X + I6.

X is number of spots, I6 is 1 if you roll a 6 and 0 otherwise. What

does Y mean?

Eg. roll a 4, X = 4, Y = 4 + 0 = 4. But if you roll a 6,

Y = 6 + 1 = 7. (That is, Y is the number of spots plus a “bonus

point” if you roll a 6.)

Sum of random variables (like Y here) for any outcome is sum of

their values for that outcome.

64

Also: if s = 1, 2, 3, 4, 5, values of X and Y are same. If s = 6,

X < Y .

Say that random variable X ≤ Y if value of X ≤ value of Y for

every single outcome. True in example.

Say that random variable X = Y if value of X equals value of Y

for every single outcome. Not true in example (different when

outcome is s = 6).

For constant random variable c, X ≤ c if all possible values of X

are ≤ c.

65

When S is infinite

When S infinite, random variable can take infinitely many different

values (but may not).

Example: S = {1, 2, 3, . . .}. If X = s, X takes all infinitely many

values in S. But define Y = 3 if s ≤ 4, Y = 2 if 4 < s ≤ 10,

Y = 1 when s > 10. Y has only finitely many (3) different values.

66

Summary

• Random variable is function from S to R: from outcome, get

real number.

• Indicator IA is 1 if event A happens, 0 if not.

• Random variable X ≥ Y if value of X ≥ value of Y for all

outcomes. Same idea for =.

• Random variable X + Y for outcome s is X(s) + Y (s).

• When S infinite, random variable may or may not take infinitely

many different values.

67

Distributions of random variables

A random variable can be described by listing all its possible vales

and their probabilities. Started this chapter with a coin-flipping

example:

Flip two (fair) coins, and note whether each coin (ordered) comes up

H or T.

Let X be “number of heads”. Could be 0, 1 or 2:

• P (X = 0) = P (TT ) = 14

• P (X = 1) = P (TH) + P (HT ) = 12

• P (X = 2) = P (HH) = 14.

Called the distribution of X .

68

Notice how can talk about P (X = s) for all s. In this case, listing

all the s for which P (X = s) > 0 describes distribution.

Consider now random variable U taking values in [0, 1] with

P (a ≤ U ≤ b) = b − a

for 0 ≤ a ≤ b ≤ 1. Try to figure out eg. P (U = 0.4): is

P (0.4 ≤ U ≤ 0.4) = 0.4 − 0.4 = 0.

Can’t define probability of a value, but still can define probability of

landing in subset of R (namely interval).

69

To account for all of this, define distribution of random variable X

as: collection of probabilities P (X ∈ B) for all subsets B of

real numbers .

Works for both examples above. Eg. in first example,

P (X ≤ 1) = P (X = 0) + P (X = 1) = 34.

In practice, often messy to define probabilities for “all possible

subsets”. Think first about examples like 1st, “discrete”, where can

talk about probabilities of individual values. Then consider

“continuous” case (like 2nd), where have to look at intervals.

70

Discrete distributions

Often it makes sense to talk about individual probs, P (X = x).

When all probability included in these probs, ie.

∑

x∈R

P (X = x) = 1,

don’t need to look at anything else.

Another way to look at it: there is a finite or countable set of x

values, x1, x2, . . ., each having probability pi = P (X = xi), such

that∑

x∈R pi = 1.

Either of these is definition of discrete distribution .

71

Compare case where P (a ≤ X ≤ b) = b − a: P (X = x) = 0

for all x, so not discrete distribution.

Another example: suppose X = −1 with prob 12, and for

0 ≤ a ≤ x ≤ b ≤ 1, P (a ≤ X ≤ b) = (b − a)/2. Can talk

about P (X = −1) = 12, but P (X = x) = 0 for any other x. So

not a discrete distribution.

Notation for discrete distributions (emphasize function):

pX(x) = P (X = x)

called probability function or mass function .

Now look at some important discrete distributions.

72

Degenerate distributions

If random variable C is constant, equal to c, then P (C = c) = 1

and P (C = x) = 0 for any x 6= c. Since∑

x∈R P (C = x) = P (C = c) = 1, is a proper (though dull)

discrete distribution. Called degenerate distribution or point

mass .

73

Bernoulli distribution

Flip a coin once, let X be number of heads (has to be 0 or 1).

Suppose P (head) = θ, so P (tail) = 1 − θ. Then

pX(1) = P (X = 1) = P (head) = θ;

pX(0) = P (X = 0) = P (tail) = 1 − θ.

X said to have Bernoulli distribution ; write X ∼ Bernoulli(θ).

Application: any kind of “success/failure”. Denote “success” by 1,

“failure” by 0. Or selection from population with two kinds of

individual like male/female, agree/disagree.

74

Binomial distribution

Now suppose we flip the coin n times (independently) and again

count number of heads. Probability of exactly x heads is

pX(x) = P (X = x) =

(

n

x

)

θx(1 − θ)n−x.

X said to have binomial distribution , written

X ∼ Binomial(n, θ).

Applications: as for Bernoulli. Eg. randomly select 100 Canadian

adults, let X be number of females.

75

Let X ∼ Binomial(4, 0.5), Y ∼ Binomial(4, 0.2). Then

x P (X = x) P (Y = x)

0 0.0625 0.4096

1 0.2500 0.4096

2 0.3750 0.1536

3 0.2500 0.0256

4 0.0625 0.0016

X probs symmetric about x = 2, Y more likely 0 or 1.

Bernoulli and binomial count successes in fixed number of trials.

Also look at waiting time problem: fix successes, count failures

observed to get them.

76

Geometric distribution

Same situation as for binomial: number of trials, independent, equal

(head) prob. θ. Let X now be number of tails before 1st head.

X = k means we observe k tails, and then a head, so

pX(k) = P (X = k) = (1 − θ)kθ, k = 0, 1, 2, . . .

X can be as large as you like, since you might wait a long time for

the first head. (Compare binomial: can’t have more than n

successes in n trials).

X has geometric distribution, prob. θ, written X ∼ Geometric(θ).

Applications: number of working light bulbs tested until first one that

fails; number of outs (non-hits) for baseball player until first hit.

77

Examples: suppose X1 ∼ Geometric(0.8) and

X2 ∼ Geometric(0.5).

k P (X1 = k) P (X2 = k)

0 0.80000 0.50000

1 0.16000 0.25000

2 0.03200 0.12500

3 0.00640 0.06250

4 0.00128 0.03125

When θ larger, 1st success probably sooner.

Also: probabilities form geometric series, hence the name.

78

Negative binomial distribution

To take geometric one stage further: Let r be a fixed number, let Y

be the number of tails before the r-th head.

Y = k only if observe r − 1 heads and k tails, in any order,

followed by a head (must finish with a head). Are r + k − 1 flips

before the final head. Prob. of this is

pY (k) = P (Y = k) =

(

r + k − 1

r − 1

)

θr−1(1 − θ)kθ

=

(

r + k − 1

k

)

θr(1 − θ)k

Write this Y ∼ Negative-Binomial(r, θ).

79

Applications: can re-use geometric distribution examples. Thus:

number of working lightbulbs tested until 5th non-working one

encountered; number of outs (non-hits) until baseball player

achieves 10th hit.

Numerical examples: let Y1 ∼ Negative-Binomial(4, 0.8) and

Y2 ∼ Negative-Binomial(3, 0.5).

80

k P (Y1 = k) P (Y2 = k)

0 0.40960 0.12500

1 0.32768 0.18750

2 0.16384 0.18750

3 0.06553 0.15625

4 0.02293 0.11718

5 0.00734 0.08203

6 0.00220 0.05468

With Y1, “heads” are likely so probably won’t see many tails before

4th H. With Y2, heads not so likely but only need to see 3 before

stopping.

81

General note

For geometric and negative binomial, some books count total

number of trials until first (or r-th) head. Gives random variables

1 + X and r + Y as defined above.

82

Poisson distribution

Suppose X ∼ Binomial(n, λ/n). We’ll think of λ as being fixed

and see what happens as n → ∞. That is, what if the number of

trials gets very large but the prob. of success gets very small?

Then

P (X = x) =

(

n

x

)(

λ

n

)x(

1 − λ

n

)n−x

=n!

x!(n − x)!nxλx

(

1 − λ

n

)n(

1 − λ

n

)−x

83

Thinking of x as fixed (for now) and letting n → ∞: the behaviour

of the factorials is determined by the highest power of n. Thus n!

behaves like nn, (n − x)! behaves like nn−x and hence

n!

(n − x)!nx→ 1.

Also,(

1 − λ

n

)−x

→ 1

because 1 − λ/n → 1 and raising it to a fixed power changes

nothing.

84

Finally,

limn→∞

(

1 − λ

n

)n

is a famous limit from calculus; it is e−λ. Thus

limn→∞

P (X = x) =e−λλx

x!.

A random variable Y with P (Y = y) = e−λλy/y! is said to have

a Poisson(λ) distribution, written Y ∼ Poisson(λ).

The Poisson distribution is a good model for rare events: that is,

events which have a large number of “chances” to happen, but have

a very small probability of happening at each “chance”. λ represents

“rate” at which events happen; doesn’t have to be integer.

85

Applications of Poisson distribution are things like: number of house

fires in a city on a given day, number of phone calls arriving at a

switchboard in an hour, number of radioactive events recorded by a

Geiger counter.

Let X ∼ Poisson(2), Y ∼ Poisson(0.8):

86

x P (X = x), λ = 2 P (Y = x), λ = 0.8

0 0.1353 0.4493

1 0.2707 0.3595

2 0.2707 0.1438

3 0.1804 0.0383

4 0.0902 0.0077

5 0.0361 0.0012

. . . . . .

• When λ is integer, highest prob at that integer and next lower

• Else next lower integer (λ < 1 ⇒ P (X = 0) highest).

87

Hypergeometric distribution

Introduction

Imagine a pot containing 10 balls, 7 red and 3 green. Prob. of

drawing a red ball is 0.7 (7/10). If we put the ball drawn back in the

pot, prob. of drawing a red ball the next time is still 0.7.

Thus, drawing with replacement, number of red balls in 4 draws

R ∼ Binomial(4, 0.7). Therefore

P (R = 4) =

(

4

4

)

(0.7)4(0.3)0 = 0.2401.

88

Now suppose we draw without replacement: that is, don’t put balls

back in pot after drawing. If we draw a red ball 1st time, there are

only 6 red balls out of 9 balls left.

Should be harder to draw 4 red balls in 4 draws because there are

fewer left after we draw each one: now

P (R = 4) =7

10· 6

9· 5

8· 4

7= 0.1667.

This is not so bad, but suppose we now want P (R = 3), say?

Need general principle for drawing without replacement.

89

The hypergeometric formula

Introduce symbols: suppose draw n balls without replacement out

of a pot containing N total. Suppose M of the balls in the pot are

red. Let X be number of red balls drawn. What is P (X = x)?

Need to count ways:

• Number of ways to draw n balls out of N in pot:(

Nn

)

.

• number of ways to draw x red balls out of M red balls in pot:(

Mx

)

.

• number of ways to draw n− x green balls out of N −M green

balls in pot:(

N−Mn−x

)

.

90

P (X = x) is number of ways to draw the red and green balls

divided by number of ways to draw n balls out of N :

P (X = x) =

(

M

x

)(

N − M

n − x

)

/

(

N

n

)

.

X said to have hypergeometric distribution:

X ∼ Hypergeometric(N,M, n). Checks:

M + (N − M) = N and x + (n − x) = n. Restrictions on x?

• Number of red balls: x ≤ n and x ≤ M so x ≤ min(n,M).

• Number of green balls: n − x ≤ n and n − x ≤ N − M , so

x ≥ 0 and x ≥ n + M − N , so x ≥ max(0, n + m − N).

91

Example 1: let X ∼ Hypergeometric(10, 7, 4):

x P (X = x)

0 0.0000

1 0.0333

2 0.3000

3 0.5000

4 0.1667

10 balls in pot, 7 red, 4 drawn. Cannot draw 0 red, because that

would mean drawing 4 green, and only 3 in pot. (Also cannot draw

more than 4 red because only drawing 4).

92

Example 2: let Y ∼ Hypergeometric(5, 3, 4):

y P (Y = y)

0 0.0

1 0.0

2 0.6

3 0.4

4 0.0

5 0.0

5 balls in pot, 3 red and 2 green, draw 4. Cannot draw more than 3

red. But also cannot draw only 0 or 1 red, because that would mean

drawing 4 or 3 green, and aren’t that many in the pot.

93

Applications

Anything that involves drawing without replacement from a finite set

of elements. Includes sampling, eg. selecting people to include in

opinion poll. (Don’t want to select same person twice). People

sampled from might agree (red ball) or disagree (green ball) with

question asked.

94

Large N

If N large, might imagine that it doesn’t matter much whether you

replace balls in pot or not. In other words, for large N , binomial

would be decent approximation. Turns out to be true:

If X ∼ Hypergeometric(N,M, n) and N large, then X has

approx. same distribution as Y ∼ Binomial(n,M/N).

95

As an example of this, suppose

Y1 ∼ Hypergeometric(20, 14, 10),

Y2 ∼ Hypergeometric(100, 70, 10),

Y3 ∼ Hypergeometric(1000, 700, 10). Number of balls in pot

increasing, fraction of red balls always 0.7, so heading to

Y ∼ Binomial(10, 0.7).

Results: Y1 probs not near Y at all; Y2 better, Y3 better still.

96

y P (Y1 = y) P (Y2 = y) P (Y3 = y) P (Y = y)

0 0.000000 0.000002 0.000005 0.000006

1 0.000000 0.000058 0.000128 0.000138

2 0.000000 0.000817 0.001376 0.001447

3 0.000000 0.006438 0.008739 0.009002

4 0.005418 0.031451 0.036255 0.036757

5 0.065015 0.099637 0.102644 0.102919

6 0.243808 0.207578 0.200839 0.200121

7 0.371517 0.281163 0.268171 0.266828

8 0.243808 0.237232 0.233862 0.233474

9 0.065015 0.112708 0.120277 0.121061

10 0.005418 0.022917 0.027704 0.028248

97

Getting one probability from the previous

one

When calculating a number of probs from a distribution, often

easiest to:

• calculate the first prob (often P (X = 0))

• calculate the next prob from the previous one

Often P (X = 0) easy case of general formula, and getting next

prob from previous has easy formula, as we’ll see.

98

Geometric distribution

Here, P (X = k) = θ(1 − θ)k, so:

• P (X = 0) = θ(1 − θ)0 = θ

• P (X=k+1)P (X=k)

= θ(1−θ)k+1

θ(1−θ)k = 1 − θ, so

P (X = k + 1) = (1 − θ)P (X = k).

Eg. if θ = 0.8, 1 − θ = 0.2 and

P (X = 0) = 0.8

P (X = 1) = (0.2)(0.8) = 0.16

P (X = 2) = (0.2)(0.16) = 0.032

P (X = 3) = (0.2)(0.032) = 0.0064 etc.

99

Poisson distribution

Here,

P (X = k) =e−λλk

k!,

so

P (X = 0) =e−λλ0

0!= e−λ

andP (X = k + 1)

P (X = k)=

e−λλk+1

(k + 1)!· k!

e−λλk=

λ

k + 1.

100

Eg. if λ = 1.7, to the accuracy shown:

P (X = 0) = e−1.7 = 0.1827

P (X = 1) =(0.1827)(1.7)

1= 0.3106

P (X = 2) =(0.3106)(1.7)

2= 0.2640

P (X = 3) =(0.2640)(1.7)

3= 0.1496

and so on.

101

Binomial distribution

This time,

P (X = k) =

(

n

k

)

θk(1 − θ)n−k,

so

P (X = 0) =

(

n

0

)

θ0(1 − θ)n = (1 − θ)n

102

and

P (X = k + 1)

P (X = k)=

(

nk+1

)

θk+1(1 − θ)n−(k+1)

(

nk

)

θk(1 − θ)n−k

=n!

(k + 1)!(n − (k + 1))!· k!(n − k)!

n!· θ

1 − θ

=(n − k)θ

(k + 1)(1 − θ),

so

P (X = k + 1) = P (X = k)(n − k)θ

(k + 1)(1 − θ).

103

Example: n = 3, θ = 0.2; have to keep track of what k and k + 1

are:

P (X = 0) = (1 − 0.2)3 = 0.512

P (X = 1) = (0.512)(3 − 0)(0.2)

(1)(1 − 0.2)= 0.384

P (X = 2) = (0.384)(3 − 1)(0.2)

(2)(1 − 0.2)= 0.096

P (X = 3) = (0.096)(3 − 2)(0.2)

(3)(1 − 0.2)= 0.008

P (X = 4) will be 0 (correct since n = 3).

104

Negative binomial

Now, P (X = k) =(

r−1+kk

)

θr(1 − θ)k, so

P (X = 0) =

(

r − 1

0

)

θr(1 − θ)0 = θr,

P (X = k + 1)

P (X = k)=

(

r+kk+1

)

θr(1 − θ)k+1

(

r−1+kk

)

θr(1 − θ)k

= (1 − θ)(r + k)!

(k + 1)!(r − 1)!· k!(r − 1)!

(r − 1 + k)!

= (1 − θ)(r + k)

(k + 1).

105

As an example, r = 3, θ = 0.9:

P (X = 0) = 0.93 = 0.729

P (X = 1) = (0.729)(0.1)3

1= 0.2187

P (X = 2) = (0.2187)(0.1)4

2= 0.04374

P (X = 3) = (0.04374)(0.1)5

3= 0.00729

and so on.

106

Hypergeometric distribution

which has a lot of factorials to deal with.

P (X = k) =

(

M

k

)(

N − M

n − k

)

/

(

N

n

)

with k ≥ max(0, n + M − N).

If n + M − N ≤ 0, start with k = 0 and

P (X = 0) =

(

M

0

)(

N − M

n

)

/

(

N

n

)

=(N − m)!(N − n)!

N !(N − m − n)!.

107

Otherwise, start with k = n + M − N and

P (X = n + M − N) =

(

M

n + M − N

)(

N − M

N − M

)

/

(

N

n

)

=M !n!

(n + M − N)!N !

after some algebra. And

P (X = k + 1)

P (X = k)=

(M − k)(n − k)

(k + 1)(N − M − n + k + 1)

after a lot of algebra.

108

Example: N = 6,M = 4, n = 3. Since 3 + 4 − 6 = 1 > 0,

start with k = 1:

P (X = 1) =4!3!

1!6!= 0.2

P (X = 2) = (0.2)(4 − 1)(3 − 1)

(2)(6 − 4 − 3 + 2)= (0.2)

(3)(2)

(2)(1)= 0.6

P (X = 3) = (0.6)(4 − 2)(3 − 2)

(3)(6 − 4 − 3 + 3)= (0.6)

(2)(1)

(3)(2)= 0.2

and remaining probs are 0.

109

Using Minitab for probability distributions

Calculating prob. distributions by hand can be annoying. Easier to

use software.

We will use statistical software Minitab for this.

Minitab does all kinds of statistical calculations. Available:

• in computer labs on campus

• bundled with textbook

• available via e-academy.com.

See Minitab manual (Evans/Rosenthal) for more.

110

Discrete distributions: binomial

See manual, chapter 4.

Suppose X ∼ Binomial(20, 0.4).

P (X = 7): select Calc, Probability Distributions, Binomial. Brings

up dialog box: enter 20 as Number of Trials, 0.4 as prob. of success.

Make sure Probability radio button checked. Click on Input

Constant, enter 7. Click OK.

111

Probability Density Function

Binomial with n = 20 and p = 0.400000

x P( X = x)

7.00 0.1659

P (X = 7) = 0.1659.

112

P (X ≤ 7) similar, but now click Cumulative Prob. Get this:

Cumulative Distribution Function


x P( X <= x)

7.00 0.4159

so that P (X ≤ 7) = 0.4159. (Easier than calculating all probs

and adding up.)

113

Poisson

Suppose X ∼ Poisson(3). What is P (X ≤ 5)?

Minitab: Calc, Probability Distributions, Poisson. Select Cumulative

Probability, enter λ value (3) in Mean box, click Input Constant and

enter 5. Results:


Poisson with mu = 3.00000

x P( X <= x)

5.00 0.9161

P (X ≤ 5) = 0.9161.

114

Listing probabilities

Suppose X ∼ Binomial(4, 0.6). To list all probs, enter 0,1,2,3,4

in column C1 (lower half of screen), select Binomial as before.

Select Probability, 4 for Number of Trials, 0.6 for prob. Click Input

Column, type C1. Click OK:


x P( X = x)

0.00 0.0256

1.00 0.1536

2.00 0.3456

3.00 0.3456

4.00 0.1296

115

Summary

• Distribution of random variable is collection of probabilities

P (X ∈ B) for all subsets B of R.

• If can describe whole distribution by giving P (X = s) (so that∑

s P (X = s) = 1, distribution called discrete.

• Also use notation pX(s) for P (X = s).

• Degenerate distribution has pC(c) = 1 for some c, and

PX(s) = 0 for s 6= c (certain to be c).

116

• Bernoulli distribution describes number of “successes” in 1 trial

(must be 0 or 1).

• Binomial distribution describes number of “successes” in n

independent trials, each having equal success prob.

• Geometric distribution describes waiting time (number of

failures) before first success (under same conditions as

binomial).

• Negative-binomial distribution describes waiting time (number of

failures) until r-th success, under same conditions as binomial.

117

• Poisson distribution describes number of occurrences of “rare”

event over fixed time or space. (Limit as n → ∞ and θ → 0

such that nθ → λ).

• Hypergeometric distribution describes number of successes in

fixed number of trials when sampling without replacement.

• Can devise (simpler) formulas for obtaining one probability in a

discrete distribution from previous ones.

• Can use Minitab to calculate probabilities from discrete

distributions.

118

Continuous distributions

Suppose, for random variable U ,

P (a ≤ U ≤ b) = b − a

for 0 ≤ a ≤ b ≤ 1.

Is legitimate probability since 0 ≤ b − a ≤ 1. But

P (U = a) = a − a = 0 for any a, so not discrete distribution.

Probability attached to intervals (or more generally subsets of R).

119

Here, probability of landing in interval [a, b] depends only on length

of interval b − a. Any intervals of same length have same

probability: no part of [0, 1] “more likely” than any other. Hence

name uniform distribution for U .

Try to get at idea of “probability of being near x”. Start with

P (x ≤ U ≤ x + δ) and let δ → 0. This will head to 0, not helpful!

TryP (x ≤ U ≤ x + δ)

δ=

δ

δ→ 1

as δ → 0. This gets at idea of “uniformity” of distribution —

probability of being “near” x same for any x ∈ [0, 1].

120

Cumulative distribution function

Define the cumulative distribution function (cdf) of any random

variable X to be

FX(x) = P (X ≤ x).

This is defined for any random variable, continuous or discrete

(though in discrete case, individual probabilities easier to work with).

Result:

P (u ≤ X ≤ v) = P (X ≤ v)−P (X ≤ u) = FX(v)−FX(u).

In words: prob. of being between u and v is prob. of being less than

v, minus prob. of being less than u as well.

121

Properties of FX(x):

• 0 ≤ FX(x) ≤ 1 (FX is a probability)

• FX(x) ≤ FX(y) whenever x ≤ y (nondecreasing): FX(x)

“collects” more probability as x increases

• limx→+∞ FX(x) = 1 (“certain to be less than +∞”)

• limx→−∞ FX(x) = 0 (“cannot be less than −∞”).

122

Density function

Try to generalize what we did for uniform distribution above:

P (x ≤ X ≤ x + δ)

δ=

FX(x + δ) − FX(x)

δ.

As δ → 0, this tends to F ′X(x), the derivative of FX(x).

Suggests that F ′X(x) will be useful: call it fX(x), the density

function of X . In sense defined above, says how likely you are to

observe value “near” x.

(Careful: density function not probability, so “how likely”

interpretation only informal.)

123

Getting probabilities from density function

Two facts: fX(x) derivative of FX(x), and FX(x) gives

probabilities. Thus probs must be integral of fX(x):

P (u ≤ X ≤ v) =

∫ v

u

fX(x) dx.

Example: uniform distribution. Know that eg.

P (0.2 ≤ U ≤ 0.6) = 0.6 − 0.2 − 0.4. Compare:

FU(x) = P (U ≤ x) = P (0 ≤ U ≤ x) = x − 0 = x, so

density function is fU(x) = F ′U (x) = 1 (as before). Hence

P (0.2 ≤ U ≤ 0.6) =

∫ 0.6

0.2

1 dx = [x]0.60.2 = 0.6 − 0.2 = 0.4.

124

Summary

• Cumulative distribution function FX(x) = P (X ≤ x)

defined for any random variable.

• Density function fX(x) = F ′X(x), if FX differentiable (it

usually is). Must have fX(x) ≥ 0 and∫∞−∞ fX(x) dx = 1.

• Probabilities from density function by

P (u ≤ X ≤ v) =

∫ v

u

fX(x) dx.

Technical detail: if FX differentiable, so fX(x) exists, X called

absolutely continuous. Possible (though not in this course) for

continuous r. v. not to be absolutely continuous.

125

Getting cumulative distribution function from

density

This requires a little care. Formula is

FX(x) =

∫ x

−∞fX(t) dt.

Note:

• change of variable of integration to t (since x a limit)

• lower limit of integral is lower limit of distribution; might be eg. 0.

126

Example: suppose fX(x) = (3 − x)/2 for 1 ≤ x ≤ 3, 0

otherwise. Then

FX(x) =1

2

∫ x

1

(3 − t) dt =1

4(−5 + 6x − x2).

Check that FX(1) = 0 and FX(3) = 1, and on interval [1, 3],

FX(x) nondecreasing (what it does elsewhere irrelevant). Implies

that 0 ≤ FX(x) ≤ 1 for 1 ≤ x ≤ 3.

127

A trickier example:

Suppose fX(x) = 2x/3, 0 ≤ x ≤ 1, fX(x) = 2/3, 1 ≤ x ≤ 2,

and 0 otherwise.

Have to handle each interval separately:

For 0 ≤ x ≤ 1,

FX(x) =

∫ x

0

2t/3 dt = x2/3.

For 1 ≤ x ≤ 2,

FX(x) =

∫ 1

0

2t/3 dt+

∫ x

1

2/3 dt =1

3+

2x

3− 2

3=

1

3(2x−1).

In 2nd case, had to split integral defining FX into two parts because

fX defined in two parts. (“Integral of density, whatever it is.”)

128

Important continuous distributions

The Uniform[L,R] distribution

Uniform[0, 1] has constant density 1 for 0 ≤ x ≤ 1. If now

fU(x) = 1/(R − L) for L ≤ x ≤ R, U ∼ Uniform[L,R].

Density still constant, no longer 1 because of need to integrate to 1

over [L,R].

Can show by integrating that

P (a ≤ U ≤ b) = (b − a)/(R − L)

for L ≤ a ≤ b ≤ R.

129

The exponential distribution

Suppose now random variable X has density fX(x) = e−x for

x ≥ 0 and 0 otherwise.

Legal density, X ∼ Exponential(1), because

∫ ∞

0

e−x dx =[

−e−x]∞0

= 0 − (−1) = 1.

Change variable in integral: x = λy, so dx = λ dy, and the limits

don’t change. Thus

1 =

∫ ∞

0

λe−λy dy

130

and fY (y) = λe−λy is a density fn for y ≥ 0:

Y ∼ Exponential(λ).

(Careful when using other books/software: sometimes our λ is

written 1/λ).

Applications: lifetimes (eg. of electrical components), inter-arrival

time between customers waiting for service (at bank, fast-food

restaurant).

Fact about Exponential(λ):

P (X ≥ x) =

∫ ∞

x

λe−λt dt = e−λx.

131

Connection between exponential and Poisson

Suppose number of customers arriving at a bank (say) in a time

period ∼ Poisson(λ). Let T1 be time until next arrival:

P (T1 ≥ t) = P (no arrivals in time[0, t]) =(λt)0

0!e−λt = e−λt.

That is, if number arriving ∼ Poisson(λ), time to next arrival

∼ Exponential(λ).

132

The Gamma(α, λ) distribution

Define the gamma function like this:

Γ(α) =

∫ ∞

0

tα−1e−t dt

Follows that

1 =

∫ ∞

0

tα−1

Γ(α)e−t dt.

Now change variable in integral: let t = λx so dt = λ dx:

1 =

∫ ∞

0

(λx)α−1

Γ(α)e−λxλ dx =

∫ ∞

0

λα xα−1

Γ(α)e−λx dx.

In other words, fX(x) = λαxα−1e−λx/Γ(α) is a density function;

X has gamma distribution , X ∼ Gamma(α, λ).

133

If α = 1, density function is λ1x0e−λx = λe−λx, the exponential

density.

Thus Gamma(1, λ) = Exponential(λ).

That is, the exponential dist is a special case of the gamma

distribution.

The gamma distribution can be used to model lifetimes (like the

exponential), but has greater flexibility. See picture next page. α

controls the shape.

134

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

x

dens

ity

Exponential(1)

Gamma(2,1)

Gamma(3,1)

135

The normal distribution

Consider function φ(z) = Ce−z2/2. Can we make this into a

density, for all z?

Must have∫∞−∞ Ce−z2/2 dz = 1, so C = 1/

√2π (text, section

2.11).

Random variable Z with this density said to have standard normal

distribution , written Z ∼ N(0, 1).

Gives “bell curve”:

136

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

dens

ity

The N(0,1) density function

137

Since

1 =

∫ ∞

−∞

1√2π

e−z2/2 dz

make change of variable z = (x − µ)/σ, so dz = (1/σ) dx.

Gives

1 =

∫ ∞

−∞

1

σ√

2πe−(x−µ)2/2σ2

dx

138

so that

fX(x) =1

σ√

2πe−(x−µ)2/2σ2

is a density function for random variable X ∼ N(µ, σ2). X said to

have a normal distribution .

Density function of X also bell-shaped, now with peak at x = µ. σ

controls spread: larger σ means larger left-right spread.

Applications: often normal distribution is good approximation,

especially when a measurement is a large number of “small things”

added together. Examples: human body measurements such as

height, weight.

Theoretical reason for this called “central limit theorem”, later.

139

Normal distribution probabilities

Probability is integral, as before:

P (a ≤ X ≤ b) =

∫ b

a

1

σ√

2πe−(x−µ)2/2σ2

dx.

Problem: can’t do this integral!

Solution: evaluate numerically, to desired accuracy. Results

available in tables, eg. back of text table D2 p. 660.

140

Tables actually give

Φ(z) = P (Z ≤ z) =

∫ z

−∞

1√2π

e−t2/2 dt,

the cumulative distribution function of N(0, 1).

So have to write problem in terms of Φ(z).

141

Facts:

• since limz→∞ Φ(z) = 1 and φ(z) symmetric about 0,

Φ(−z) = 1 − Φ(z).

• P (a ≤ Z ≤ b) = Φ(b) − Φ(a).

• Hence P (Z ≥ a) = P (a ≤ Z < ∞) = 1 − Φ(a).

• Note that Table D2 only gives Φ(z) for negative z.

142

z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

-1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681

-1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823

-1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985

-1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170

-1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379

-0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611

-0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867

-0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148

-0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451

-0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776

-0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121

-0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483

-0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859

-0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247

-0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641

Table 1: Standard normal table part 2

143

Example: for Z ∼ N(0, 1), find P (−1.48 ≤ Z ≤ 0.56).

P (−1.48 ≤ Z ≤ 0.56) = Φ(0.56) − Φ(−1.48)

= (1 − Φ(−0.56)) − Φ(−1.48)

= (1 − 0.2877) − 0.0694 = 0.6429.

Example 2: for Z ∼ N(0, 1), find P (Z ≥ 0.90):

P (Z ≥ 0.90) = 1 − P (Z ≤ 0.90)

= 1 − Φ(0.90)

= 1 − (1 − Φ(−0.90)) = 1 − (1 − 0.1841)

= 0.1841.

144

Probs for non-standard normal

So if X ∼ N(µ, σ2), how to calculate P (a ≤ X ≤ b)?

P (z1 ≤ Z ≤ z2) = Φ(z2) − Φ(z1) =

∫ z2

z1

1√2π

e−z2/2 dz

Change variables again: let z = (x − µ)/σ, so dz = (1/σ) dx:

=

∫ b

a

1

σ√

2πe−(x−µ)2/2σ2

dx = P (a ≤ X ≤ b)

exactly what we want, except need relationship between z1, z2 and

a, b:

z1 =a − µ

σ; z2 =

b − µ

σ.

145

In other words, get z1 and z2 from a and b, and then answer is

Φ(z2) − Φ(z1).

Example: suppose X ∼ N(0.5, 16), find P (0 ≤ X ≤ 2).

µ = 0.5, σ2 = 16 so σ = 4. Thus

z1 =0 − 0.5

4= −0.125, z2 =

2 − 0.5

4= 0.375.

Because 0.375 halfway between 0.37 and 0.38, Φ(0.375) about

halfway between Φ(0.37) = 1 − 0.3557 and

Φ(0.38) = 1 − 0.3520. Thus Φ(0.375) = 0.64615. Likewise

Φ(−0.125) = (0.4522 + 0.4483)/2 = 0.45025. Hence

P (0 ≤ X ≤ 2) = 0.64615 − 0.45025 = 0.1959.

146

Getting probabilities from continuous

distributions with Minitab

As with discrete distributions, can use Minitab to calculate

probabilities from continuous distributions. Saves integration/use of

tables.

147

Normal distribution

Redo previous example. If X ∼ N(0.5, 16), find P (0 ≤ X ≤ 2).

First, type values 0 and 2 into column C1.

Select Calc, Probability distributions, Normal. Click Cumulative

Prob. “Mean” is µ = 0.5, “standard deviation” is σ = 4. Select

Input Column, type in C1. Click OK:


Normal with mean = 0.500000 and standard deviation = 4.00000

x P( X <= x)

0.0000 0.4503

2.0000 0.6462

Thus P (0 ≤ X ≤ 2) = 0.6462 − 0.4503 = 0.1959.

148

Uniform[L,R] distribution

Suppose X ∼ Uniform[3, 7]. P (X ≥ 6)? (Can tell it’s 14

almost

by looking at it.)

Select Calc, Prob. Distributions, Uniform. Click Cumulative

Distribution. Enter endpoints 3 and 7. Click Input Constant, enter 6.

Click OK:


Continuous uniform on 3.00000 to 7.00000

x P( X <= x)

6.0000 0.7500

So P (X ≥ 6) = 1 − P (X ≤ 6) = 1 − 0.75 = 0.25.

149

Exponential distribution

Suppose X ∼ Exponential(2): P (0.4 ≤ X ≤ 1.2)?

Enter values 0.4 and 1.2 into column C1. (Overtype if needed.)

Select Calc, Probability Distributions, Exponential. Click Cumulative

Probability. Then enter “mean” as 1/λ = 1/2 = 0.5. Select

Input Column and enter C1. Click OK:


Exponential with mean = 0.500000

x P( X <= x)

0.4000 0.5507

1.2000 0.9093

so P (0.4 ≤ X ≤ 1.2) = 0.9093 − 0.5507 = 0.3586.

150

Gamma distribution

Exponential can be done by integration; gamma cannot. (Previous:∫ 1.2

0.42e−2x dx = 0.3586.)

Suppose X ∼ Gamma(3, 2); P (X ≤ 1.5)?

Select Calc, Prob. Distributions, Gamma. Click Cumulative Prob.

First Shape Parameter is α = 3; second is 1/λ = 0.5. Click Input

Constant, enter 1.5. Click OK.


Gamma with a = 3.00000 and b = 0.500000

x P( X <= x)

1.5000 0.5768

Prob. is 0.5768.

151

Summary

• Get CDF from density by integrating, but: careful with variable of

integration, lower limit.

• Uniform[L,R] distribution has constant density on interval

[L,R].

• Exponential(λ) density is λe−λx for x ≥ 0.

• Exponential often used for time between events; number of

events per unit time then ∼ Poisson(λ).

• Gamma(α, λ) has density λαxα−1e−λx/Γ(α) for x ≥ 0. α

controls shape.

152

• Standard normal distribution has density e−z2/2/√

2π; has

“bell curve” shape.

• if X ∼ N(µ, σ2), X has (non-standard) normal distribution

with peak at µ and spread controlled by σ.

• Get probabilities for Z and X using tables. Tables give CDF of

z, Φ(z); write everything else in terms of that.

• Using Minitab, can get probabilities without integration/tables.

153

Random variables neither discrete nor

continuous

Distributions can be neither discrete nor continuous.

Example from earlier: suppose X = −1 with prob 12, and for

0 ≤ a ≤ x ≤ b ≤ 1, P (a ≤ X ≤ b) = (b − a)/2. Can talk

about P (X = −1) = 12, but P (X = x) = 0 for any other x. So

not a discrete distribution.

But since P (X = −1) = 12

> 0, not a continuous distribution

either.

154

One-dimensional change of variable

Suppose X is some random variable with known distribution, and

Y = h(X), h(X) a known function. Then what is distribution of

Y ?

If X discrete, have to work out all possible values of Y (may be

finite, at worst countable).

If X continuous, may be uncountably many different values of Y .

155

X discrete

Example: flip a fair coin twice, let X be number of heads. Then

X ∼ Binomial(2, 0.5), so

P (X = 0) = 0.25, P (X = 1) = 0.5, P (X = 2) = 0.25.

Let Y = (X − 1)2. Then possible Y are:

X Y

0 (0 − 1)2 = 1

1 (1 − 1)2 = 0

2 (2 − 1)2 = 1

Thus P (Y = 0) = P (X = 1) = 0.5 and

P (Y = 1) = P (X = 0) + P (X = 2) = 0.25 + 0.25 = 0.5.

156

Idea similar whenever X discrete: work out all possible values of Y .

If more than one X value can lead to the same value of Y , then get

prob for Y by adding up X probs: specifically

P (Y = y) =∑

x:h(x)=y

P (X = x).

If h(X) 1-1, then each value of X leads to different value of Y .

(For instance, if h increasing.)

157

X continuous

If X continuous, then Y might not be continuous at all. It depends

on h(X).

Example: Let X ∼ Uniform[0, 1]. Let h(x) = 7 if x ≤ 34, and let

h(x) = 5 if x > 34.

Then

P (Y = 7) = P

(

X ≤ 3

4

)

=3

4

P (Y = 5) = P

(

X >3

4

)

=1

4.

These add up to 1, so (even though X continuous) Y has discrete

distribution.

158

Reason for Y being discrete: h(X) only took countably many

different values.

Usually, though, h(X) takes on uncountably many values, and

have to be more careful.

Second example: X ∼ Uniform[0, 1], Y = 3X . What is

distribution of Y ?

159

Start with cumulative distribution function:

P (Y ≤ y) = P (3X ≤ y) = P (X ≤ y/3) = y/3

since FX(x) = x if X ∼ Uniform[0, 1].

Since 0 ≤ X ≤ 1, must have 0 ≤ Y/3 ≤ 1 so 0 ≤ Y ≤ 3. And

can get density of Y by differentiating: fY (y) = (y/3)′ = 13. Thus

Y ∼ Uniform[0, 3].

In general: if h(X) increasing, work with cumulative, and substitute.

160

Change of variable using density functions

May not have cumulative distribution function available, so need a

formula that works with density functions.

Mimic what we did above, and adapt. Again assume h(X)

increasing:

FY (y) = P (Y ≤ y) = P (h(X) ≤ y)

= P (X ≤ h−1(y)) = FX(h−1(y)).

Differentiate both sides wrt y:

fY (y) = fX(h−1(y))d

dy(h−1(y)) = fX(h−1(y))

1

h′(h−1(y)).

This is desired result.

161

Redo above example: fX(x) = 1 for 0 ≤ x ≤ 1,

Y = h(X) = 3X so h−1(y) = y/3. Then h′(x) = 3 and

fY (y) = fX(y/3)1

h′(y/3)= (1)

1

3=

1

3

as before.

162

If h(X) not increasing

If h(X) decreasing, then use the same result above (with a slightly

different proof).

If h(X) not 1-1, then can have difficulties.

Example: suppose X ∼ Uniform[0, 2], and

Y = h(X) = (X − 1)2. h(X) neither increasing nor decreasing;

indeed in general two values of X giving same value of Y .

Thus care required. Eg.

P

(

Y ≤ 1

2

)

= P

(

(X − 1)2 ≤ 1

2

)

= P

(

|X − 1| ≤ 1√2

)

= P

(

1 − 1√2≤ X ≤ 1 +

1√2

)

=1√2.

163

Summary

• Random variables usually discrete or continuous, but could be

neither.

• If r. v. Y is function h of X , can find distribution of Y from

distribution of X .

• if X discrete, work out all possible values of Y .

• if X continuous, Y might be discrete (depends on h(X)).

• if h(X) increasing (decreasing) function of X over domain of

X , then Y continuous. Can go via CDF of X to CDF of Y , or

work with density functions directly.

• if h(X) not 1-1, care needed.

164

Joint Distributions

Know how to describe random variables one at a time: probability

function (discrete), density function (continuous), cumulative

distribution function (either).

But two random variables X , Y might be related. Don’t have a way

to describe this.

Example: X ∼ Bernoulli(2/3). Let Y = 1 − X .

Y ∼ Bernoulli(1/3) (count failures not successes). X,Y

related, but doesn’t show in individual probability functions.

165

Joint probability functions

Can simply find probability of all possible combinations of values for

X,Y . Uses individual probability functions and relationship.

In example: if X = 0, then Y = 1; if X = 1, then Y = 0.

Possible values for Y depend on value of X . Also,

P (X = 1) = 2/3.

Notation: pX,Y (x, y) = P (X = x, Y = y) (comma is “and”),

called joint probability function . In example:

pX,Y (1, 0) = 2/3; pX,Y (0, 1) = 1/3.

Are only possible combinations of X and Y values.

166

Often convenient to depict as table. Above example:

x \ y 0 1

0 0 13

1 23

0

Another:

u \ v 0 1 2

0 13

16

16

1 16

112

112

Note that all probabilities in each case sum to 1, because joint

probability function covers all possibilities.

167

Joint density functions

If random variables continuous, joint probability function makes no

sense; instead, define joint density function f(x, y) that

expresses chance of being “near” (X = x, Y = y).

Joint density function also covers all possible values of X,Y , so

integrates to 1 when integrated over both x and y.

168

Example: f(x, y) = 4x2y + 2y5, 0 ≤ x, y ≤ 1 (page 83).

Integrate over both x and y. Since both variables lie between 0 and

1, those are the limits of integration:

∫ 1

0

∫ 1

0

4x2y + 2y5 dx dy =

∫ 1

0

[

4

3x3y + 2xy5

]x=1

x=0

dy

=

∫ 1

0

4

3y + 2y5 dy

=

[

4

6y2 +

2

6y6

]y=1

y=0

= 1,

showing that f(x, y) is a legal joint density function.

169

Sometimes possible values of Y depend on value of X . Account for

in integration.

Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1. (Thus

if X = 0.6, Y cannot exceed 0.4.) Region forms triangle: Figure

2.7.3 of text (p. 85). Verify density by letting y limits of integration

depend on x (y = 1 − x), and integrating wrt y first.

170

171

∫ 1

0

∫ 1−x

0

120x3y dy dx =

∫ 1

0

[

60x3y2]y=1−x

y=0dx

=

∫ 1

0

60x3(1 − x)2 dx

=

∫ 1

0

60x3 − 120x4 + 60x5 dx

= [15x4 − 24x5 + 10x6]10

= 15 − 24 + 10 = 1.

172

Bivariate normal distribution

Suppose X , Y both have standard normal distributions, and

suppose −1 < ρ < 1. Then the bivariate standard normal

distribution with correlation ρ has joint density function

f(x, y) =1

2π√

1 − ρ2exp

{

− 1

2(1 − ρ2)(x2 + y2 − 2ρxy)

}

.

Plotting in 3D (Figure 2.7.4) gives a 3D bell shape.

ρ measures relationship between X and Y :

• ρ = 0: no relationship

• ρ > 0: when X > 0, Y likely > 0

• ρ < 0: when X > 0, Y likely < 0.

173

Bivariate standard normal has peak at (0, 0). Replacing x by

(x − µ1)/σ1 and y by (y − µ2)/σ2 shifts peak to (µ1, µ2) and

changes decrease of density away from peak (larger σ values mean

slower decrease).

174

Calculating probabilities

For a continuous random variable X , calculate probabilities by

integrating, eg. P (a < X ≤ b) =∫ b

af(x) dx.

Same idea for continuous joint distribution, integrating over x and y.

Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1. Find

P (0.5 ≤ X ≤ 0.7, Y > 0.2).

Draw picture:

175 176

Area is trapezoid. Integrate over y first, then x. Call prob P :

P =

∫ 0.7

0.5

∫ 1−x

0.2

120x3y dy dx

=

∫ 0.7

0.5

[60x3y2]y=1−xy=0.2 dx

=

∫ 0.7

0.5

60x3(1 − x)2 − 2.4x3 dx

=

∫ 0.7

0.5

57.6x3 − 120x4 + 60x5 dx

= [14.4x4 − 24x5 + 10x6]x=0.7x=0.5

= 14.4(0.74 − 0.54) − 24(0.75 − 0.55) + 10(0.76 − 0.56)

= 0.294.

177

Marginal distributions

Started from individual distributions for X,Y plus relationship. But:

start from joint, get individual?

One way: get distribution of X by “averaging” over distribution of Y .

Discrete: simply row and column totals. Example:

u \ v 0 1 2 Sum

0 13

16

16

23

1 16

112

112

13

Sum 12

14

14

1

178

Without knowledge of V , U twice as likely 0 as 1; without

knowledge of U , V twice as likely 0 as 1 or 2.

Row totals here give marginal distribution of U ; column totals

here marginal distribution of V . Each marginal distribution is

proper probability distribution (probs sum to 1).

179

Continuous: integrate over other variable. Get marginal density

function .

Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1.

Marginal density for X : integrate over y. Limits 0, 1 − x; get

fX(x) =

∫ 1−x

0

120x3y dy = 60x3(1 − x)2.

For Y : integrate over x, limits 0, 1 − y:

fY (y) =

∫ 1−y

0

120x3y dx = 30y(1 − y)4.

“Integrating out” unwanted variable.

Alternative approach via cumulative; text page 79.

180

Example 2: bivariate standard normal. Recall standard normal

density; integrates to 1, so∫ ∞

−∞

1√2π

exp

[

−1

2u2

]

du = 1.

Marginal distribution of x in bivariate standard normal: integrate out

y:

fX(x) =

∫ ∞

−∞

1

2π√

1 − ρ2exp

[

−x2 + y2 − 2ρxy

2(1 − ρ2)

]

dy.

Substitution: let u = (y − ρx)/√

1 − ρ2, so du = dy/√

1 − ρ2.

Then

u2 =y2 − 2ρxy + ρ2x2

1 − ρ2

181

which is nearly what appears inside “exp”. Precisely:

fX(x) =

∫ ∞

−∞

1

2πexp

[

−u2 + x2

2

]

du

=1√2π

exp(−x2/2)

∫ ∞

−∞

1√2π

exp(−u2/2) du.

Integral is 1 (of a standard normal density), so

fX(x) =1√2π

exp(−x2/2) :

that is, marginal distribution of X is standard normal.

182

Conditioning and Independence

Marginal distribution: of one variable, ignorant about other.

But what if we knew X ; what then about distribution of Y ?

Example 1:

x \ y 0 1

0 0 13

1 23

0

Suppose X = 1. Then ignore 1st row.

183

But 2nd row not probability distribution (sum 23

not 1). Idea: divide

by sum. Then if X = 1, P (Y = 0) = 1 and P (Y = 1) = 0: that

is, if X = 1, Y certain to be 0. Called conditional distribution of

Y given X = 1.

If X = 0, Y certain to be 1. Conditional distribution of Y different

for different X : Y depends on X .

Notation: as for conditional probability. Eg. above:

P (Y = 1|X = 0) = 1.

184

Example 2:

u \ v 0 1 2

0 13

16

16

1 16

112

112

Conditional distribution of V given U = 0? Use U = 0 row. This

sums to 23, so divide by this to get P (V = 0|U = 0) = 1

2, P (V =

1|U = 0) = 14, P (V = 2|U = 0) = 1

4.

U = 1 line sums to 13; conditional distribution of V given U = 1 is

same as given U = 0.

In example 2, does not matter what U is – conditional distribution of

V same. Say that V and U are independent .

185

Two examples give extreme cases. In Example 1, knowing X gave

Y with certainty; in example 2, knowing U said nothing about V .

Most cases in between: knowing one variable has some effect on

distribution of other.

Symbols:

P (Y = b|X = a) =P (X = a, Y = b)

∑

y P (X = a, Y = y)=

P (X = a, Y = b)

P (X = a).

Denominator is marginal probability that X = a.

186

Conditioning on continuous random variables

Continuous case: no probabilities, so replace with density functions;

replace sum by integral. This gives conditional density function :

fY |X(y|x) =fX,Y (x, y)

∫∞−∞ fX,Y (x, y) dy

=fX,Y (x, y)

fX(x),

replacing infinities by actual limits for y. Denominator depends on x

only; is marginal density function for X .

Then use conditional density to evaluate conditional probabilities.

187

Example: fX,Y (x, y) = 4x2y + 2y5 for x, y between 0 and 1, 0

otherwise. Find P (0.2 ≤ Y ≤ 0.3|X = 0.8).

Steps: find marginal density of X , use to find conditional density of

Y given X , integrate conditional density to find probability.

Marginal density of X is

fX(x) =

∫ 1

0

4x2y + 2y5 dy =

[

2x2y2 +2

6y6

]1

0

= 2x2 +1

3.

So conditional density of Y given X is

fY |X(y|x) =4x2y + 2y5

2x2 + 13

.

188

Note how denominator doesn’t depend on y, so

P = P (0.2 ≤ Y ≤ 0.3|X = 0.8)

=1

2x2 + 13

∫ 0.3

0.2

4x2y + 2y5 dy

=1

2x2 + 13

[

2x2y2 +1

3y6

]y=0.3

y=0.2

=1

2x2 + 13

·(

2x2(0.32 − 0.22) +1

3(0.36 − 0.26)

)

=0.1x2 + 0.000665

3

2x2 + 13

= 0.0398,

putting in x = 0.8.

189

Followup: what happens to P (0.2 ≤ Y ≤ 0.3) if X changes?

One answer: P (0.2 ≤ Y ≤ 0.3|X = 0.4) = 0.0242, compared

to P (0.2 ≤ Y ≤ 0.3|X = 0.8) = 0.0398. So probability does

change as X changes; Y does depend on X .

Here, conditioning event X = 0.8 has zero probability, so have to

use densities. Otherwise, use standard probability rules, eg.

P (0.2 ≤ Y ≤ 0.3|0.4 ≤ X ≤ 0.8)

=P (0.4 ≤ X ≤ 0.8, 0.2 ≤ Y ≤ 0.3)

P (0.4 ≤ X ≤ 0.8)

worked out the usual way with integrals.

190

Law of total probability

Because

fY |X(y|x) =fX,Y (x, y)

∫∞−∞ fX,Y (x, y) dy

=fX,Y (x, y)

fX(x),

also true that

fX,Y (x, y) = fX(x)fY |X(y|x).

191

So

P (a ≤ X ≤ b, c ≤ Y ≤ d)

=

∫ d

c

∫ b

a

fX,Y (x, y) dx dy

=

∫ d

c

∫ b

a

fX(x)fY |X(y|x) dx dy.

In words: can find probabilities either using joint density or using a

marginal and a conditional density. Can use whichever easier.

192

Independence of random variables

Recall this joint distribution:

u \ v 0 1 2

0 13

16

16

1 16

112

112

Sum 12

14

14

Conditional distribution of V same given U = 0 and given U = 1.

Also same as marginal distribution of V . Knowing U says nothing

about V .

(Also, conditional dist. of U same for all V and same as marginal for

U .)

193

Suggests definition: random variables independent if conditional

distribution always same, and always same as marginal.

Mathematics: X,Y independent if

pY (y) = pY |X(y|x) =pX,Y (x, y)

pX(x)

so that

pX,Y (x, y) = pX(x)pY (y).

This is usually easiest check:

• if pX,Y (x, y) = pX(x)pY (y) for all x, y, then X,Y

independent.

• if pX,Y (x, y) 6= pX(x)pY (y) for any one (x, y) pair, then

X,Y not independent.

194

For example above: P (U = 0) = 23, P (U = 1) = 1

3;

P (V = 0) = 12, P (V = 1) = P (V = 2) = 1

4. Also,

P (U = 0)P (V = 0) =2

3· 1

2=

1

3= P (U = 0, V = 0).

Repeat for all u and v: proves independence.

195

Compare this joint distribution:

x \ y 0 1

0 0 13

1 23

0

Now,

P (X = 0)P (Y = 0) =1

3· 2

3=

2

9

and P (X = 0, Y = 0) = 0 6= 29. One calculation shows X,Y

not independent.

196

Independence of continuous random variables

As usual, turn probability into density. If

fX,Y (x, y) = fX(x)fY (y)

for all x, y, then continuous random variables X,Y independent. If

it fails for any (x, y) pair, not independent.

Example: suppose fX(x) = 2x2 + 13, fY (y) = 4

3y + 2y5,

fX,Y (x, y) = 4x2y + 2y5 for 0 ≤ x, y ≤ 1. Then

fX(x)fY (y) =

(

2x2 +1

3

)(

4

3y + 2y5

)

which cannot be simplified to fX,Y (x, y). So X,Y not

independent.

197

Order statistics

Suppose that X1, X2, . . . , Xn all, independently, have same

distribution (a sample from distribution). Suppose common cdf

FX(x).

For example: take 20 people, give each IQ test. Without knowing

about individuals, use same distribution for each. What might

highest score in sample be?

Idea: more people sampled, higher the highest score could be (get

more chances to see a very high score).

198

Let M = max(X1, X2, . . . , Xn). Then

P (M ≤ m) = P (X1 ≤ m,X2 ≤ m, . . . ,Xn ≤ m)

= P (X1 ≤ m)P (X2 ≤ m) · · ·P (Xn ≤ m)

= [FX(m)]n .

If X continuous, differentiate to get density.

Example: each Xi ∼ Uniform[0, 1]. Then FX(x) = x, so

P (M ≤ m) = xn.

If n = 5, P (M ≤ 0.9) = 0.95 = 0.59; if n = 20,

P (M ≤ 0.9) = 0.920 = 0.1216, much smaller. That is, with

more observations, the maximum is likely to be higher (less likely to

be low).

199

Similar idea for minimum: let K = min(X1, X2, . . . , Xn). Then

P (K ≤ k) = 1 − P (K > k)

= 1 − P (X1 > k,X2 > k, . . . ,Xn > k)

= 1 − P (X1 > k)P (X2 > k) · · ·P (Xn > k)

= 1 − (1 − FX(k))n.

Example: if n = 10, Xi ∼ Uniform[0, 1], then

P (K ≤ 0.2) = 1 − (1 − 0.2)10 = 0.8926.

200

Summary

• Two r. v. X and Y might be related; express using joint

distribution.

• if X,Y discrete, use joint probability function (prob. of every

combination of values for X and Y ).

• if X,Y continuous, use joint density function.

• Joint probability/density function sum/integrate to 1 over both x

and y.

• Bivariate normal distribution: X,Y individually normal but

possibly related, with correlation ρ.

• Prob. from joint density: integrate over both x and y.

201

• Marginal distribution of Y is that of Y “averaged” over X

(sum/integrate); dist. of Y in ignorance of X .

• Conditional distribution of Y given X is that of Y if X known.

• If marginal and conditional distributions of Y are same, then

knowing X has no effect on Y : X and Y independent.

• Prove independence by showing that joint prob. fn. / density at

(x, y) equal to product of marginal prob. fn. / density for all x

and y. Failure anywhere means not independent.

• Order statistics: can find CDF of max or min of sample from a

distribution.

202

Simulating probability distributions

So far, considered mathematical properties of distributions:

probabilities, densities, cdf’s etc. But some distributions difficult to

understand or use.

Generate random values from distribution.

approximation of difficult-to-calculate quantities

simulation of complex systems

generating potential solutions for difficult problems

random choices for quizzes, computer games

understanding behaviour of samples (chapter 4)

203

Pseudo-random numbers

In practice, don’t get actual random numbers, but pseudo-random

numbers. These follow recipe, but look random. (Paradox?)

Not so bad, because crucial feature: unpredictable – cannot easily

say what comes next.

Typical method: multiplicative congruential generator . Start with

initial “seed” value R0, then, for n = 0, 1, . . .:

Rn+1 = 106Rn + 1283 (mod 6075)

(“take remainder on division by 6075”).

204

Eg. start with r0 = 1001:

R1 = 106(1001) + 1283 = 107389 (mod 6075) = 4114

R2 = 106(4114) + 1283 = 437367 (mod 6075) = 6042

R3 = 106(6042) + 1283 = 641735 (mod 6075) = 3860

and so on, with 0 ≤ Ri < 6075.

Gives up to 6075 different random integers before repeating itself.

Suitable choice of constants gives long “period” and unpredictable

sequence. (Number theory.)

In practice, use much larger constants – get many more possible

random numbers.

205

Continuous uniform on [0, 1]

To get (pseudo-) random values from Uniform[0, 1], take

pseudo-random integers and divide by maximum. Result has

approx. uniform distribution.

With generator above, max value is 6075, so random uniform values

are 4114/6075 = 0.677, 6042/6075 = 0.995,

3860/6075 = 0.635. (Only 6075 possible values, so only 3 or so

digits trustworthy.)

“Random numbers” in calculators, Excel etc. of this kind.

Random Uniform[0, 1] values are used as building block for

random values from other distributions. Eg. random

Y ∼ Uniform[0, b]: multiply a random Uniform[0, 1] by b.

206

Bernoulli distribution

Suppose we want to simulate X ∼ Bernoulli(0.4): single trial,

prob. 0.4 of success.

Take single random uniform U . If U ≤ 0.4, take X = 1 (success),

otherwise take X = 0 (failure).

Works because U ≤ 0.4 about 0.4 of the time, so will get

successes about 0.4 of the time (long run).

In general, for X ∼ Bernoulli(θ), take X = 1 if U ≤ θ, 0

otherwise.

207

Binomial and geometric distributions

If Y ∼ Binomial(n, θ), Y = X1 + X2 + · · · + Xn where

Xi ∼ Bernoulli(θ). So just generate n random Bernoullis and

add them up.

Similarly, if Z ∼ Geometric(θ), Z is number of failures (in

Bernoulli trials) before 1st success. So get random value of Z like

this:

1. set Z = 0

2. generate U from Uniform[0, 1]

3. if U ≤ θ, stop with current Z

4. otherwise, add 1 to Z and return to step 2.

208

Inverse-CDF method

Cdf F (x) = P (X ≤ x) defined for all x.

Also, in set of possible X-values (where f(x) > 0), F (x)

invertible: for any p, exactly one x where F (x) = p.

Example: X ∼ Exponential(λ). Then F (x) = 1 − e−λx. For

x > 0, write p = F (x), and solve for x to get

x = −1

λln(1 − p).

Then generate a random p from Uniform[0, 1], and put it in the

formula to get a random X .

209

For instance, if λ = 2, might have p = 0.7 and hence random X is

−12ln(1 − 0.7) = 0.602.

Why does this work in general?

Let Y be any random variable; let F (y) = P (Y ≤ y) be cdf of Y .

Define random variable W = F (Y ). Then

P (W ≤ w) = P (F (Y ) ≤ w)

= P (Y ≤ F−1(w)) = F{F−1(w)} = w.

That is, W ∼ Uniform[0, 1] whatever the distribution of Y .

210

So: to simulate Y , simulate W , then use relationship

Y = F−1(W ) to simulate Y (by using simulated uniform in place

of W ).

This was done above for exponential. Called inverse-CDF method .

211

Also works for discrete. Example: Poisson(0.7) has this cdf:

x 0 1 2 3 4

P (X ≤ x) 0.497 0.844 0.966 0.994 0.999

Procedure: get random U ∼ Uniform[0, 1]. If U ≤ 0.497, take

random X = 0; else if U ≤ 0.844, take X = 1, . . . , else if

U > 0.999, take X = 5.

(Higher values possible, but very unlikely; for more accuracy use

more digits.)

212

Normal distribution

Difficult to simulate from (cannot invert cdf).

But consider X , Y with bivariate standard normal distribution,

correlation 0. Joint density is

fX,Y (x, y) =1

2πexp

{

−1

2(x2 + y2)

}

.

Thinking of (x, y) as point in R2, note that density depends only on

distance from origin (r2 = x2 + y2), not on angle.

So generate random (x, y) pair by generating random angle

θ ∼ Uniform[0, 2π], random distance, separately.

(details: 2-variable transformation using Jacobian determinant.)

213

Density function for distance R is

fR(r) = re−r2/2

and cdf is

FR(r) =

∫ r

0

te−t2/2 dt = 1 − e−r2/2

(eg. use substitution u = t2/2, du = t dt).

FR(r) invertible; let p = FR(r), solve for r to get

r =√

−2 ln(1 − p).

Get random R by taking U ∼ Uniform[0, 1], using for p above.

214

Finally, convert random R, θ to (X,Y ) using polar coordinate

formulas

X = R cos θ; Y = R sin θ.

Example: suppose random θ = 1.8 (radians), U = 0.3. Then

R =√

−2 ln(1 − 0.3) = 0.8446. So

X = 0.8446 cos 1.8 = −0.19; Y = 0.8446 sin 1.8 = 0.82.

215

Rejection methods

Inverse-CDF method doesn’t always work – cdf can be too

complicated to invert. Example: X ∼ Gamma(3, 1), with density

function

f(x) =x2

2e−x.

This has maximum 2e−2 = 0.2707 at x = 2. Density “small”

beyond x = 10.

216

Idea: sample random point (X,Y ) in rectangle enclosing f(x),

with 0 ≤ X ≤ 10, 0 ≤ Y ≤ 2e−2 (using uniform distribution):

• if point below density function (Y ≤ f(X)), take X as random

value from distribution

• otherwise, reject (X,Y ) pair and try again.

Chance of X-value being accepted proportional to density f(X):

when value more likely in distribution, more likely to be accepted.

217

Example:

X 7.3 1.0 2.7 1.7 9.4 5.5

Y 0.206 0.130 0.023 0.256 0.197 0.203

f(X) 0.018 0.184 0.245 0.264 0.004 0.062

reject y n n n y y

Values 7.3, 9.4, 5.5 rejected; 1.0, 2.7, 1.7 random values from

Gamma(3, 1).

Needed 12 random uniforms to generate 3 random gammas.

218

Can be made more sophisticated. Let g(x) be density function that

is easy to sample from, such that f(x) ≤ cg(x) for all x (choose

c). Above, g(x) = 1, c = 2e−2.

Generate random value X from distribution with density g(x).

Generate random Y ∼ Uniform[0, cg(X)]. If Y ≤ f(X),

accept X ; otherwise, reject and try again.

Efficiency of rejection method greatest when cg(x) only slightly

greater than f(x); then, very little rejection.

219

Simulation in Minitab

Minitab can generate random values from many distributions (using

methods above or variations).

Basic procedure:

• Select Calc, Random Data

• Select desired distribution

• Fill in number of random values to generate

• Fill in (empty) column to store values

• Fill in parameters of distribution (if any)

• Click OK.

220

Examples: Uniform[0, 1], Bernoulli(0.4), Binomial(5, 0.4),

Exponential(2), Poisson(0.7), Normal(0, 1).

To generate random values from another distribution, generate

column of values from Uniform[0, 1], then use Calculator to create

desired values (p. 47–48 of manual).

Recall random values actually “pseudo-random”: starting at same

seed value gives same sequence of random values. Can set seed

value in Minitab (Calc, Set Base) to get reproducible random values.

221

Summary

• Simulation: generate “random” values from distribution.

• “Random” because actually obtained from formula, but gives

unpredictable results.

• Generate random Uniform[0, 1] by taking random integers,

divide by max.

• Use as building blocks for other distributions.

• Bernoulli, binomial, geometric: generate random trials using

random Uniform[0, 1] to generate success/failure.

222

• Inverse CDF: use random Uniform[0, 1] to get random F (x)

value; find x corresponding.

• Special method for normal distribution.

• Rejection method: pick random x, y; based on value of y,

decide whether to keep or reject x.

• Minitab.

223

Expectation

224

Introduction

Game: toss fair coin, win $2 for a head, lose $1 for a tail.

Amount you win is random variable W with

P (W = 2) = P (W = −1) = 12.

Could win or lose on any one play, but (a) winning and losing equally

likely, (b) amount won greater than amount lost.

Would probably play this game given chance, because expect to win

in long run, on average over many plays, even though anything

possible.

225

Expected value of random variable is its long-run average. For W

above, expect equal number of 2’s and −1’s, so expected value

would be

E(W ) =2 + (−1)

2=

1

2.

Another: suppose Y = 7 always (ie. P (Y = 7) = 1,

P (Y = k) = 0 for k 6= 7). Then E(Y ) should be 7.

Another: roll 2 dice. Win $30 for double 6, lose $1 otherwise. Looks

good because potential win greater than potential loss, but win very

unlikely. How to balance? For winnings random variable V , what is

E(V )?

226

Expectation for discrete random variables

Define expected value (expectation ) of random variable X :

E(X) =∑

x

xP (X = x),

“sum of value times probability”. Sum over all possible x.

Check for above examples:

E(W ) = 2 · 1

2+ (−1) · 1

2=

1

2E(Y ) = 7 · 1 = 7

E(V ) = 30 · 1

36+ (−1)

35

36= − 5

36

227

First 2 as expected.

For V , prob. of double 6 is 136

, so chance of losing is 1 − 136

. Even

though prize large (win $30 for double 6), E(V ) < 0, so would lose

in long run, because win prob even smaller than prize large.

Formula much easier than reasoning out – less thought!

Now suppose X ∼ Bernoulli(θ). What is E(X)?

X = 1 with prob θ, 0 with prob 1 − θ, so:

E(X) = 1 · θ + 0 · (1 − θ) = θ.

In long run, average X equal to success probability.

Makes sense (think of θ = 0 and θ = 1 as extreme cases).

228

Expectation for geometric and Poisson distributions

To find more complicated expectations, cleverness can be needed

to figure out sum.

Suppose Z ∼ Geometric(θ), so P (Z = k) = θ(1 − θ)k. Then

E(Z) =∞∑

k=0

kθ(1 − θ)k =1 − θ

θ.

Method: write (1 − θ)E(Z) to look like E(Z) but with k − 1 in

place of k, subtract.

Mean is odds against success: if failure 4 times more likely than

success, on average get 4 failures before 1st success.

229

If X ∼ Poisson(λ), then

E(X) =

∞∑

k=0

k · e−λλk

k!.

Note that the k = 0 term is 0, so start sum at k = 1, then let

l = k − 1 to get

E(X) = λ∞∑

l=0

e−λλl

l!.

The sum is of all the probabilities from a Poisson distribution, so is

1. (Or,∑∞

l=0(λl/l!) is the Maclaurin series for eλ.)

So for X ∼ Poisson(λ), E(X) = λ. Thus parameter λ in fact

mean.

230

St Petersburg Paradox

Game: toss fair coin, let Z be #tails before 1st head. Win 2Z

dollars. Thus for TTTH, win 23 = $8. Expected winnings (fair price

to pay to play)?

∞∑

k=0

2k · 1

2k· 1

2=

∞∑

k=0

1

2= ∞.

How can this be? Only ever win finite amount.

Play game 10 times:

Z 0 1 0 0 3 0 3 0 6 1

Winnings 1 2 1 1 8 1 8 1 64 2

Mean winnings $8.90, larger than actual winnings 90% of time!

231

Problem is that any one big payoff completely dominates average,

and by playing game enough times, can make it very likely that a

very big payoff will occur.

If there is a maximum payoff, say $230, expectation finite ($15.50).

When random variable can be arbitrarily large, expectation may not

be finite. But can be finite – compare Poisson, where probabilities

decrease faster than values increase. Similarly, lotteries with very

big prizes still have expected winnings less than ticket price

(because chance of winning big prize small enough).

232

Summary

• Expectation is long-run average value of random variable.

• For discrete, is sum of value times probability.

• If Z ∼ Geometric(θ), E(Z) = (1 − θ)/θ.

• If X ∼ Poisson(λ), E(X) = λ.

• When r. v. can be arbitrarily large, expectation may not be finite

(St Petersburg), but can be (Poisson).

233

Utility and Kelly betting

In St Petersburg paradox, expectation didn’t tell story, because “fair

price” ought to be finite. Changing game by a little changed

expected winnings a lot.

Most bets look like this: win known $w if you win, lose $1 if you lose.

Suppose probability of winning is θ. Then expectation is

E = wθ + (−1)(1 − θ) = θ(w + 1) − 1

which is positive if θ > 1/(w + 1).

For instance, if w = 2, E > 0 if θ > 1/3. That is, if you believe

your chance of winning is better than 13, you should bet because in

long run you win more than you lose.

234

If bet more than $1, wins and losses increase in proportion: on bet

of $b, win $wb or lose $b.

Positive expectation seems to say “bet everything you have”: far too

risky for most! Always possibility of losing.

Idea: consider utility of money, not same as money itself. If you

only have $10, $1 is a lot of money (has great utility), but if you have

$1 million, $1 almost meaningless.

Utility of money varies between people, but could be proportional to

current fortune. Then, utility of money depends on log of $ amount.

235

Suppose we currently have $c, and want to choose b for bet above,

assuming all else known. Then fortune after the bet is F = c + bw

if we win (prob θ), F = c − b if we lose (prob 1 − θ). Utility idea:

choose b to maximize E(lnF ):

E(lnF ) = θ ln(c + bw) + (1 − θ) ln(c − b).

Take derivative (for b), set to 0:

dE(lnF )

db= w

θ

c + bw+(−1)

1 − θ

c − b=

θw(c − b) − (1 − θ)(c + bw)

(c + bw)(c − b).

Zero when numerator zero; solve for b to get

b =c{θ(w + 1) − 1}

w=

cE

w.

This is called the Kelly bet . (If negative, don’t bet anything!)

236

Examples, with c = 100:

• w = 9, θ = 18. E = θ(w + 1) − 1 = 0.25, so Kelly bet

b = 100(0.25)/9 = $2.78.

• w = 1.5, θ = 12. E = 0.25 again; Kelly bet

b = 100(0.25)/1.5 = $16.67.

Note: expected winnings same in both cases, but bet less when

w = 9: more risk because less likely to win.

In general, bet fraction of current fortune that is bigger when

expected winnings bigger and chance of winning bigger.

237

Expectation of functions of random variables

In St Petersburg problem above, random variable was number of

tails Z , but winnings 2Z . In effect, found that E(2Z) was infinite.

Method: sum values of 2Z times probability.

Formally: let g(X) be some function of random variable X . Then

E(g(X)) =∑

x

g(x)P (X = x).

238

Linearity of expected values

Suppose we have two random variables X,Y . What is

E(X + Y )?

Go back to definition, bearing in mind that X,Y might be related,

so have to use joint probability function:

E(X + Y ) =∑

x

∑

y

(x + y)P (X = x, Y = y)

=∑

x

xP (X = x) +∑

y

yP (Y = y)

= E(X) + E(Y ).

Details: expand out (x + y) in first sum, recognize (eg.) that∑

y P (X = x, Y = y) = P (X = x) (marginal distribution).

239

Same logic shows that E(aX + bY ) = aE(X) + bE(Y ).

Likewise,

E(X1 + X2 + · · · + Xn) = E(X1) + E(X2) + · · · + E(Xn).

Also, if Y = 1 always, we get E(aX + b) = aE(X) + b.

240

Expectation for binomial distribution

If Y ∼ Binomial(n, θ), then Y actually sum of Bernoullis:

Y = X1 + X2 + · · · + Xn, where Xi ∼ Bernoulli(θ).

Know that E(Xi) = θ, so (by result on previous page)

E(Y ) = θ + θ + · · · + θ = nθ.

Makes sense: eg. if you succeed on one-third of trials on average

(θ = 13), and you have n = 30 trials, you’d expect 10 successes,

and nθ = 10.

241

Independence and E(XY )

Since E(X + Y ) = E(X) + E(Y ) for all X and Y , tempting to

claim that E(XY ) = E(X)E(Y ). But is this true?

Consider this joint distribution:

Y = 1 Y = 2 Total

X = 0 13

16

12

X = 1 14

14

12

Total 712

512

1

Using marginal distributions, E(X) = 12

and E(Y ) = 1712

. What is

E(XY )?

242

When X = 0, XY = 0 for all Y . So P (XY = 0) = 13

+ 16

= 12.

XY = 1 when X = 1, Y = 1, so P (XY = 1) = 14. Likewise,

XY = 2 when X = 1, Y = 2, so P (XY = 2) = 14. Hence

E(XY ) = 0 · 1

2+ 1 · 1

4+ 2 · 1

4=

3

4.

But

E(X)E(Y ) =1

2· 17

12=

17

246= 3

4.

So E(XY ) 6= E(X)E(Y ) in general.

243

But what if X,Y independent? Then

E(XY ) =∑

x

∑

y

xyP (X = x)P (Y = y) = E(X)E(Y ),

rearranging, because joint prob is product of marginals.

So, if X,Y independent, then E(XY ) = E(X)E(Y ), but not

necessarily otherwise.

See later (in “covariance”) that difference E(XY ) − E(X)E(Y )

measures extent of non-independence of X and Y .

244

Monotonicity of expectation

Suppose X,Y discrete random variables such that X ≤ Y . (That

is, for any event giving X = x and Y = y, x ≤ y always.

Example: roll 2 dice, let X be score on 1st die, Y be total score on 2

dice.)

How do E(X), E(Y ) compare?

Idea: let Z = Y − X . Then Z ≥ 0, discrete, and

E(Z) =∑

z≥0 zP (Z = z). All terms in sum positive or 0, so

E(Z) ≥ 0. But E(Z) = E(Y − X) = E(Y ) − E(X). Hence

E(Y ) − E(X) ≥ 0.

Conclusion: if X ≤ Y , then E(X) ≤ E(Y ).

245

Expectation for continuous random

variables

Can’t use formula

E(X) =∑

x

xP (X = x)

because probability of particular value not meaningful for continuous

X .

Standard procedure: replace probability by density function, replace

sum by integral.

246

That is, if X continuous random variable, define

E(X) =

∫ ∞

−∞x f(x) dx.

In integral, replace infinite limits by actual upper and lower limits.

247

Examples

Suppose X ∼ Uniform[0, 1], so f(x) = 1, 0 ≤ x ≤ 1. Then

E(X) =

∫ 1

0

x · 1 dx =

[

1

2x2

]1

0

=1

2.

As you would have guessed.

Suppose W ∼ Exponential(λ). Then

E(W ) =

∫ ∞

0

wλe−λw dw.

Integrate by parts with u = w, v′ = λe−λw: E(W ) = 1/λ.

If W represents time between events, E(W ) in units of time, so λ

in units of 1 / time: a rate, number of events per unit time.

248

Suppose Z ∼ N(0, 1), so f(z) = (1/√

2π)e−z2/2. Then

E(Z) =

∫ ∞

−∞

1√2π

ze−z2/2 dz.

Replacing z by −z gives negative of function in integral, ie. f(z) is

odd function. Hence integral is 0, so E(Z) = 0. (Alternative:

substitute u = z2/2.)

249

As for discrete, expectation may not be finite.

f(x) = 1/x2, x ≥ 1 is a proper density, but for random variable X

with this distribution:

E(X) =

∫ ∞

1

x · 1

x2dx =

∫ ∞

1

1

xdx = [ln x]∞1 = ∞.

Problem: though density decreases as x increases, does not do so

fast enough to make E(X) integral converge.

250

Properties of expectation for continuous random

variables

These are same as for discrete variables. Proofs use integrals and

densities not sums, but otherwise very similar. Suppose X has

density fX(x) and X,Y have joint density fX,Y (x, y):

• E(g(X)) =∫∞−∞ g(x)fX(x) dx

• E(h(X,Y )) =∫∞−∞∫∞−∞ h(x, y)fX,Y (x, y) dx dy.

• E(aX + bY ) = aE(X) + bE(Y )

• If X,Y independent, then E(XY ) = E(X)E(Y )

• If X ≤ Y , then E(X) ≤ E(Y ).

251

Expectations for general uniform and normal

distributions

Suppose X ∼ Uniform[a, b]. Then

U = (X − a)/(b − a) ∼ Uniform[0, 1], so E(U) = 12.

Write in terms of X : X = a + (b − a)U , so

E(X) = a + (b − a)E(U) = (a + b)/2. Again as expected.

Now suppose X ∼ Normal(µ, σ2). Then

Z = (X − µ)/σ ∼ N(0, 1). Write X = µ + σZ ; then

E(X) = µ + σE(Z) = µ + σ(0) = µ.

That is, parameter µ in normal distribution is the mean.

252

Summary

• Utility of money may be proportional to current fortune: depends

on log of fortune.

• With current fortune c, bet with winnings w, expected value E,

utility maximized by betting amount cE/w if E positive (Kelly

bet).

• Function g(X) of random variable X :

E(g(x)) =∑

x g(x)P (X = x).

• Linearity: E(X + Y ) = E(X) + E(Y ) always; also

E(aX + b) = aE(X) + b.

• If Y ∼ Binomial(n, θ), E(Y ) = nθ.

253

• If X,Y independent, E(XY ) = E(X)E(Y ), but not

necessarily otherwise.

• If X ≤ Y , then E(X) ≤ E(Y ).

• For continuous X , E(X) =∫∞−∞ xfX(x) dx.

• If X ∼ Uniform[0, 1], E(X) = 12.

• If W ∼ Exponential(λ), E(W ) = 1/λ.

• Expectation for discrete and continuous distributions has same

properties.

• If X ∼ Uniform[a, b], E(X) = (a + b)/2.

• If X ∼ N(µ, σ2), E(X) = µ.

254

Variance, covariance and correlation

Compare random variables:

Z = 10 with prob 1, Y = 5, 15 each prob 12.

E(Z) = E(Y ) = 10, but Y further from mean than Z .

Expectation only gives long-run average of random variable, not how

much higher/lower than average it could be. For this, use variance :

Var(X) = E[(X − µX)2], µX = E(X).

255

For discrete X , Var(X) =∑

x(x − µX)2P (X = x). So:

Var(Z) = (10 − 10)2 · 1 = 0;

Var(Y ) = (5 − 10)2 · 1

2+ (15 − 10)2 · 1

2= 25.

Here, Var(Y ) > Var(Z) because Y tends to be further from its

mean than Z does.

(Here, Y always further from mean than Z . But in general,

Var(Y ) > Var(Z) means Y likely to be further from mean than

Z .)

256

More about variance

Because (X − µX)2 ≥ 0, Var(X) ≥ 0 for all random variables

X .

Var(X) = 0 only if X does not vary (compare Z). No upper limit

on variance; larger variance means more unpredictable (can get

further from mean).

Why square? Cannot just omit: E(X − µX) = E(X) − µX = 0

always. Absolute value E(|X − µX |) possible, but hard to work

with (not differentiable).

257

Standard deviation

If random variable X in metres, Var(X) in metres-squared. For

interpretation, suggests using square root of variance:

SD(X) =√

Var(X)

which would be in metres. Called standard deviation of X .

SD easier for interpretation, variance easier for algebra.

258

Variance of Bernoulli

If X ∼ Bernoulli(θ), E(X) = θ, and

Var(X) =∑

x

(x − θ)2P (X = x)

= (1 − θ)2θ + (0 − θ)2(1 − θ)

= θ(1 − θ)(1 − θ + θ) = θ(1 − θ).

This is 0 if θ = 0, 1 (when results completely predictable) and

maximum, 14, when θ = 1

2.

259

Useful properties of variance

Var(aX + b) = a2 Var(X).

Because variance in squared units, changing X eg. from metres to

feet multiplies variance not by 3.3 but by that squared.

Also, adding b changes mean of X , but doesn’t change how spread

out distribution is (shifts left/right).

Var(X) = E(X2) − µ2X .

Useful result for finding variances in practice, since E(X2) not

usually too hard.

260

Proofs: use definition of variance as expectation, then rules of

expectation.

Bernoulli revisited: E(X2) = 12θ + 02(1 − θ) = θ, so

Var(X) = θ − θ2 = θ(1 − θ) as before.

261

Variance of exponential distribution

For continuous distributions, find E(X2) or variance using integral.

W ∼ Exponential(λ): already know E(W ) = 1/λ. Find

Var(W ) by first finding E(W 2), using integration by parts:

E(W 2) =

∫ ∞

0

w2λe−λw dw =[

−w2e−λw]∞0

+2

λ

∫ ∞

0

wλe−λw dw.

Square brackets 0; integral is E(W ) = 1/λ. Hence

E(W 2) = (2/λ)(1/λ) = 2/λ2, and

Var(W ) =2

λ2−(

1

λ

)2

=1

λ2.

For exponential distribution, variance is square of mean.

262

Variance of normal random variable

Suppose Z ∼ N(0, 1). Know that E(Z) = 0, so

Var(Z) = E(Z2) − 02 = E(Z2). Thus

Var(Z) =

∫ ∞

−∞z2 1√

2πe−z2/2 dz.

To tackle by parts: let u = z/√

2π, v′ = ze−z2/2. v′ has

antiderivative v = −e−z2/2. Gives

263

Var(Z) =

[

− z√2π

e−z2/2

]∞

−∞+

∫ ∞

−∞

1√2π

e−z2/2 dz.

Square bracket 0 (e−z2/2 → 0 very fast); integral that of density of

Z , so 1. Hence Var(Z) = 1.

Suppose now X ∼ N(µ, σ2). Then Z = (x − µ)/σ, so

X = µ + σZ . So Var(X) = σ2 Var(Z) = σ2. That is,

parameter σ2 in normal distribution is variance.

264

Summary

• Variance says how far r. v. is from its expectation:

Var(X) = E[(X − µX)2].

• 0 ≤ Var(X), but no upper limit.

• Standard deviation SD(X) =√

Var(X) in same units as X .

• If X ∼ Bernoulli(θ), Var(X) = θ(1 − θ).

• Var(aX + b) = a2 Var(X); Var(X) = E(X2) − µ2X .

• If W ∼ Exponential(λ), Var(W ) = 1/λ2.

• If X ∼ N(µ, σ2), Var(X) = σ2.

265

Covariance

Consider discrete joint distribution:

Y = 1 Y = 2 sum

X = 0 0.4 0.2 0.6

X = 1 0.1 0.3 0.4

sum 0.5 0.5

If X = 0, Y more likely to be small; if X = 1, Y more likely to be

large. X,Y vary together.

Idea: covariance Cov(X,Y ) = E[(X − µX)(Y − µY )].

266

Here, µX = E(X) = 0.4, µY = E(Y ) = 1.5, so take all

combinations of (X − µX , Y − µY ) values and their probs:

Cov(X,Y )

= (0 − 0.4)(1 − 1.5)(0.4) + (0 − 0.4)(2 − 1.5)(0.2)

+ (1 − 0.4)(1 − 1.5)(0.1) + (1 − 0.4)(2 − 1.5)(0.3)

= 0.08 − 0.04 − 0.03 + 0.04 = 0.10.

Result positive. (X,Y ) combinations where (X − µX)(Y − µY )

positive outweigh those where negative. That is, when X large, Y

more likely to be large as well (and small with small).

Covariance can be negative: then large X goes with small Y and

vice versa. Covariance 0: no trend.

267

Calculating covariances

Useful formula:

Cov(X,Y ) = E(XY ) − E(X)E(Y ).

Proof: definition of covariance, properties of expectation.

Previous example revisited:

E(XY ) = (0)(1)(0.4)+(0)(2)(0.2)+(1)(1)(0.1)+(1)(2)(0.3) = 0.7;

Cov(X,Y ) = 0.7 − (0.4)(1.5) = 0.1.

As with corresponding variance formula, useful for calculations.

268

Covariance and independence

If X,Y independent, then E(XY ) = E(X)E(Y ), so

Cov(X,Y ) = E(XY ) − E(X)E(Y ) = 0.

But covariance could be 0 without independence. Example:

(X,Y ) = (−1, 1), (0, 0), (1, 1), each prob 13. E(X) = 0,

E(Y ) = 23, E(XY ) = (−1)(1

3) + (0)(1

3) + (1)(1

3) = 0, so

Cov(X,Y ) = 0 − (0)(23) = 0. But X,Y not independent: given

X , know Y exactly.

Relationship between X,Y not a trend: as X increases, Y

decreases then increases. No general statement about Y

large/small as X increases.

Fact: if X,Y bivariate normal, covariance 0 implies independence.

269

Variance of sum

Previously found that E(X + Y ) = E(X) + E(Y ) for all X,Y .

Corresponding formula for variances?

Derive formula for Var(X + Y ) by writing as expectation,

expanding out square, recognizing terms:

Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X,Y ).

Logic: if Cov(X,Y ) > 0, X,Y big/small together, sum could be

very big/small, variance large. If Cov(X,Y ) < 0, large X

compensates small Y and vice versa, sum of moderate size,

variance small.

If X,Y independent, then Var(X + Y ) = Var(X) + Var(Y ).

270

Variance of binomial distribution

Suppose X ∼ Binomial(n, θ). Then can write

X = Y1 + Y2 + · · · + Yn,

where Yi ∼ Bernoulli(θ) independently. So

Var(X) = Var(Y1) + Var(Y2) + · · · + Var(Yn)

= θ(1 − θ) + θ(1 − θ) + · · · + θ(1 − θ)

= nθ(1 − θ).

Variance increases as n increases (fixed θ) because range of

possible #successes becomes wider.

271

Correlation

Covariance hard to interpret. Eg. size of positive correlation says

little about X,Y relationship.

Suppose X height (metres), Y weight (kg). Units of covariance m

× kg. Measure height in inches, weight in lbs: covariance in

different units.

Try for scale-free quantity. Covariance measures how X,Y vary

together: suggests use of variances. Var(X) m2, Var(Y ) kg2, so

right scaling is by sq root of each. Define correlation :

Corr(X,Y ) =Cov(X,Y )

√

Var(X) Var(Y ).

272

Example: (X,Y ) = (0, 1), (1, 3), each prob 12.

E(X) = 0.5, E(Y ) = 2; XY = 0, 3 each prob 12

so

Cov(X,Y ) = 32− (0.5)(2) = 1

2.

Also, Var(X) = 14, Var(Y ) = 1, so

Corr(X,Y ) =12

√

(14)(1)

= 1.

When X larger (1 vs. 0), Y also larger (3 vs. 1) for certain: a perfect

trend. So this should be largest possible correlation.

(Proof later: Cauchy-Schwartz inequality.)

273

More about correlation

Smallest possible correlation is −1, when larger X always goes

with smaller Y (eg. (X,Y ) = (0, 1), (1,−3) with prob 12).

If X,Y independent, covariance 0, so correlation 0 also.

In-between values represent in-between trends. Eg.

Corr(X,Y ) = 0.5: larger X with larger Y most of the time, but

not always.

Correlation actually measures extent of linear relationship between

random variables. X,Y in example related by Y = 2X + 1.

Perfect nonlinear relationship won’t give correlation ±1.

274

Viewing correlation by simulation

Useful to have sense of what correlation “looks like”.

Generate random normals with required correlation, plot.

Suppose X,Y ∼ N(0, 1) independently. Then use X and

Z = αX + Y for suitable choice of α: correlated if α 6= 0

because X in both. Can show Cov(X,αX + Y ) = α and

Corr(X,αX + Y ) = α/√

1 + α2.

Choose α to get desired correlation ρ: α = ±ρ/√

1 − ρ2.

275

Correlation 0.95:

−3 −2 −1 0 1 2

−10

−5

05

x

z

276

Correlation -0.8:

−3 −2 −1 0 1 2

−4

−2

02

4

x

z

277

Correlation 0.5:

−2 −1 0 1 2 3

−2

01

23

x

z

278

Correlation -0.2:

−3 −2 −1 0 1 2 3

−2

−1

01

2

x

z

279

Summary

• Covariance Cov(X,Y ) = E[(X − µX)(Y − µY )] =

E(XY ) − E(X)E(Y ).

• Can be + or −; + means larger X tends to go with larger Y .

• If X,Y independent, then Cov(X,Y ) = 0.

• If X,Y bivariate normal and Cov(X,Y ) = 0, then X,Y

independent.

• Can be other X,Y with covariance 0 but not independent.

280

• Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X,Y ); if

X,Y independent, then Var(X + Y ) = Var(X) + Var(Y ).

• If X ∼ Binomial(n, θ), then Var(X) = nθ(1 − θ).

• Corr(X,Y ) = Cov(X,Y )/√

Var(X) Var(Y ); between

−1 and 1.

• Can use simulation to get picture of different-sized correlations.

281

Moment-generating functions

Means and variances (and eg. E(X3)) can be messy: each one

needs an integral (sum) to be solved. Would be nice to have function

that gives E(Xk) more easily than by integration (summing).

Consider mX(s) = E(esX). Function of s.

Maclaurin series for exp function:

mX(s) = E(1) + sE(X) +s2

2!E(X2) +

s3

3!E(X3) + · · · .

282

Differentiate both sides (as function of s):

m′X(s) = E(X) + sE(X2) +

s2

2!E(X3) + · · ·

Putting s = 0 gives m′(0) = E(X). Differentiate again:

m′′X(s) = E(X2) + sE(X3) + · · ·

so that m′′X(0) = E(X2).

By same process, find E(Xk) by differentiating mX(s) k times,

and setting s = 0. Differentiating easier than integrating!

E(Xk) called k-th moment of distribution of X ; function mX(s),

used to get moments, called moment generating function for X .

283

If X discrete,

mX(s) = E(esX) =∑

x

esxP (X − x)

and if X continuous,

mX(s) = E(esX) =

∫ ∞

−∞esxfX(x) dx.

284

Examples of moment generating functions

Bernoulli is easiest of all:

mX(s) = es·0P (X = 0) + es·1P (X = 1) = 1 − θ + θes.

So:

m′X(s) = θes ⇒ E(X) = θ

m′′X(s) = θes ⇒ E(X2) = θ

and indeed E(Xk) = θ for all k. Also,

Var(X) = E(X2) − [E(X)]2 = θ − θ2 = θ(1 − θ).

285

Now try X ∼ Exponential(λ), continuous:

mX(s) = E(esX) =

∫ ∞

0

esxλe−λx dx = λ(λ − s)−1

after some algebra. (Requires s < λ.)

m′X(s) = λ(λ − s)−2, so E(X) = m′

X(0) = 1/λ.

m′′X(s) = 2λ(λ − s)−3, so E(X2) = m′′

X(0) = 2/λ2. Hence

Var(X) =2

λ2−(

1

λ

)2

=1

λ2.

286

More about moment-generating functions

If X ∼ Poisson(λ), then

mX(s) = eλ(es−1).

If X ∼ N(0, 1), then

mX(s) = es2/2.

Facts:

• mX+Y (s) = mX(s)mY (s). (Mgf of sum is product of

moment-generating functions.)

• maX+b(s) = ebsmX(as). (Mgf of linear function related to

mgf of original random variable.)

287

Proofs from definition.

First result very useful: distribution of sum very difficult to find, but

can get moments for sum much more easily.

If X ∼ Binomial(n, θ), then X = Y1 + Y2 + · · · + Yn where

each Yi ∼ Bernoulli(θ). Hence

mX(s) = [mYi(s)]n = (1 − θ + θet)n.

If X ∼ N(µ, σ2), X = µ + σZ where Z ∼ N(0, 1). Thus

mX(s) = mσZ+µ(s) = eµsmZ(σs) = eµs+σ2s2/2.

288

Using mgfs to recognize distributions

Important result, called uniqueness theorem . Suppose X has mgf

finite for −s0 < s < s0; suppose mX(s) = mY (s) for

−s0 < s < s0. Then X , Y have same distribution.

In other words: if mgf of X is that of known distribution, then X

must have that distribution.

Example: X,Y ∼ Poisson(λ). X + Y has mgf

mX+Y (s) = {eλ(es−1)}2 = e2λ(es−1).

This is mgf of Poisson(2λ), so X + Y ∼ Poisson(2λ).

289

Summary

• Moment generating function mX(s) = E(esX) is function of s

(sum for discrete, integral for continuous).

• Get E(Xk) (k-th moment) by differentiating mX(s) k times

(wrt s), put s = 0.

• If X ∼ Bernoulli(θ), mX(s) = 1 − θ + θes.

• If X ∼ Exponential(λ), mX(s) = λ/(λ − s).

• If X ∼ Poisson(λ), mX(s) = eλes−1.

• If Z ∼ N(0, 1), mZ(s) = es2/2.

290

• mX+Y (s) = mX(s)mY (s).

• maX+b(s) = ebsmX(as).

• Last two results lead to mgf’s for binomial and normal.

• Uniqueness theorem: if X and Y have same mgfs, have same

distribution.

291

Conditional Expectation

Consider this joint distribution (Ex. 3.5.2):

X = 5 X = 8 sum

Y = 0 17

37

47

Y = 3 17

0 17

Y = 4 17

17

27

sum 37

47

X,Y related: if Y = 0, then X more likely to be 8.

292

Suppose Y = 3. Then P (X = 5|Y = 3) = (17)/(1

7) = 1,

P (X = 8|Y = 3) = 0/(17) = 0. If Y = 3, then X certain to be

5, so E(X|Y = 3) = 5.

Now suppose Y = 4:

P (X = 5|Y = 4) =17

17

+ 17

=1

2= P (X = 8|Y = 4).

If Y = 4, then average X is E(X|Y = 4) = 5 · 12

+ 8 · 12

= 6.5.

Likewise, E(X|Y = 0) = 7.25.

293

These expectations from conditional distribution called conditional

expectations . E(X|Y = y) varies from 5 to 7.25 depending on

value of Y ; “on average, X depends on Y ”.

In general, if X,Y related, then mean of X depends on Y .

Calculate conditional distribution of X|Y , find X-expectation. This

is conditional expectation.

294

Conditional expectation: continuous case

Same principle: find expectation of conditional distribution. Now use

joint and marginal densities to find conditional density; then

integrate to get expectation.

Example: fX,Y (x, y) = 4x2y + 2y5, 0 ≤ x, y ≤ 1.

Conditional density fX|Y (x, y) = fX,Y (x, y)/fY (y). So first find

marginal density fY (y) by integrating out x from joint density:

fY (y) = 43y + 2y5. Has no x. Hence

fX|Y (x, y) =4x2y + 2y5

43y + 2y5

.

295

Note: only x in numerator, so not so hard. Thus

E(X|Y = y) =

∫ 1

0

x · 4x2y + 2y5

43y + 2y5

dx =1 + y4

43

+ 2y4.

Depends slightly on Y : E(X|Y = 0) = 0.75,

E(X|Y = 0.5) = 0.729, E(X|Y = 1) = 0.6. As Y increases,

X decreases, on average.

296

Conditional expectations as random variables

Without particular Y -value in mind, can define E(X|Y ) by taking

E(X|Y = y) and replacing y by Y . Above example:

E(X|Y ) =1 + Y 4

43

+ 2Y 4.

This kind of conditional expectation is random variable (function of

random variable Y ).

297

As random variable, E(X|Y ) must have expectation,

E[E(X|Y )]. What is it? Directly, as function of y:

E[E(X|Y )] =

∫ 1

0

E(X|Y = y)fY (y) dy =2

3

(much cancellation). Now: marginal density of x is

fX(x) = 2x2 + 13

(integrate out y from joint density), so

E(X) =

∫ 1

0

x

(

2x2 +1

3

)

dx =2

3= E[E(X|Y )].

Not a coincidence. Illustrates theorem of total expectation :

E[E(X|Y )] = E(X). In words: effect of varying Y is to change

E(X|Y ), but E[E(X|Y )] averages out these effects, leaving only

overall average of X .

298

Conditional variance

Conditional variance is variance of conditional distribution.

Return to previous discrete example:

X = 5 X = 8 sum

Y = 0 17

37

47

Y = 3 17

0 17

Y = 4 17

17

27

sum 37

47

If Y = 3, X certain to be 5, so Var(X|Y = 3) = 0.

But if Y = 4, X equally likely 5 or 8; Var(X|Y = 4) = 2.25.

299

(Calculation: E(X|Y = 4) = 6.5, E(X2|Y = 4) = 44.5,

Var(X|Y = 4) = 44.5 − (6.5)2 = 2.25.)

Another expression of how Y affects X . If know Y = 3, know X

exactly, but if Y = 4, more uncertain about possible X .

300

Summary

• Conditional expectation E(X|Y = y) gives “average” X for

given Y .

• Calculate from conditional distribution of X|Y = y.

• Same way: define conditional expectation E(X|Y ) as random

variable (depends on Y ).

• Total expectation: E(E(X|Y ) = E(X).

• Conditional variance Var(X|Y = y) is variance of conditional

distribution of X given Y = y. Expresses how variable X is for

different Y .

301

Inequalities relating probability, mean and

variance

Mean and variance closely related to probabilies. Are general

relationships true for wide range of random variables and

distributions.

Markov inequality: If X cannot be negative, then

P (X ≥ a) ≤ E(X)

a.

In words: if mean small, X unlikely to be very large.

302

Chebychev inequality:

P (|Y − µY | ≥ a) ≤ Var(Y )

a2.

In words: if variance small, Y unlikely to be far from mean.

(Variations in spelling: best English transliteration from Russian

probably “Chebyshov”.)

303

Example: suppose X = 0, 1, 2 each with probability 13. Then

E(X) = 1, E(X2) = 53, so Var(X) = 2

3.

Markov with a = 1.5 says P (X ≥ 1.5) ≤ 11.5

= 23. Actual

P (X ≥ 1.5) = P (X = 2) = 13, which is indeed ≤ 2

3.

Chebychev with a = 0.9:

P (|X − 1| ≥ 0.9) ≤ (2/3)/(0.9)2 = 0.823. Actual

P (|X − 1| ≥ 0.9) = P (X ≤ 0.1) + P (X ≥ 1.9) = P (X =

0) + P (X = 2) = 23.

Bounds from Markov and Chebychev inequalities often not very

close to truth, but guaranteed, so can use inequalities to prove

results.

304

Proof of Markov inequality

Uses idea that if Z ≤ X , then E(Z) ≤ E(X).

Define random variable Z = a if X ≥ a, 0 otherwise. Because

X ≥ 0, value of Z always ≤ that of X : Z ≤ X .

E(Z) = aP (X ≥ a) + 0P (X < a) = aP (X ≥ a).

But Z ≤ X so E(Z) ≤ E(X) and therefore

aP (X ≥ a) ≤ E(X). Divide both sides by a. Done.

305

Proof of Chebychev inequality

This uses Markov’s inequality with clever choice of random variable.

Let X = (Y − µY )2; X ≥ 0. Then Markov’s inequality (with a2

replacing a) says

P (X ≥ a2) ≤ E(X)

a2⇒ P [(Y −µY )2 ≥ a2] ≤ E[(Y − µY )2]

a2.

In last inequality, E[.] is Var(Y ). On left, both terms in probability

≥ 0, so can square-root both sides. Gives

P (|Y − µY | ≥ a) ≤ Var(Y )

a2

which is Chebychev’s inequality. Done.

306

Cauchy-Schwartz and Jensen inequalities

Cauchy-Schwartz:

|Cov(X,Y )| ≤√

Var(X) Var(Y ) ⇒ |Corr(X,Y )| ≤ 1.

Proof: page 188 of text. Idea, for X,Y having mean 0: write

E(X − λY )2 in terms of variances and covariances; result must

be ≥ 0.

Jensen’s inequality relates E(g(X)) and g(E(X)). Specifically,

if g(x) is concave up (that is, g′′(x) ≥ 0), then

g(E(X)) ≤ E(g(X)).

307

Proof: Tangent line to concave-up function always ≤ function

(picture). Consider tangent line to g(x) at x = E(X); suppose

equation is a + bx. Then g(E(X)) = a + bE(X). Also, line

≤ g(x) everywhere else, so

a + bX ≤ g(X) ⇒ E(a + bX) ≤ E(g(X))

⇒ a + bE(X) ≤ E(g(X))

⇒ g(E(X)) ≤ E(g(X)).

Done.

(Note: text uses “convex” for “concave up”.)

308

Consequences of Jensen’s inequality

Take g(x) = x2. Then (E(X))2 ≤ E(X2). But

Var(X) = E(X2) − (E(X))2 ≥ 0, so knew that anyway.

Another: suppose X = 1, 2, 3, each prob 13. Then E(X) = 2.

But get another kind of average by multiplying 3 possible values and

taking 3rd root. This is called geometric mean. Here is

(1.2.3)1/3 = 1.817. Ordinary mean greater than geometric mean.

Look at log of geometric mean:

ln{(1.2.3)1/3} =1

3ln(1.2.3) =

1

3(ln 1+ln 2+ln 3) = E(ln X).

Thus geometric mean is eE(ln X).

309

Jensen: − ln x is concave up for x > 0, so

− ln(E(X)) ≤ E(− lnX) ⇒ ln(E(X)) ≥ E(lnX).

Exponentiate both sides (eln y = y):

E(X) ≥ eE(ln X).

This says that for any positive random variable X , the ordinary

mean will always be ≥ the geometric mean.

310

Summary

• Markov’s inequality: P (X ≥ a) ≤ E(X)/a.

• Chebychev’s inequality: P (|Y − µY | ≥ a) ≤ Var(Y )/a2.

• These inequalities guaranteed, so can be used for proofs.

• Cauchy-Schwartz: |Cov(X,Y )| ≤√

Var(X) Var(Y ), so

|Corr(X,Y )| ≤ 1.

• Jensen: if g(x) concave up (g′′(x) ≥ 0), then

g(E(X)) ≤ E(g(X)).

• Consequence: geometric mean always ≤ ordinary mean.

311

Sampling Distributions and

Limits

312

Introduction: roulette

See http://tinyurl.com/238p5 for intro to game.

Basic idea: bet on number or number combination. Roulette wheel

spun, one number is winner. Your bet wins if it contains winning

number.

Wheel also contains numbers 0, 00. Winning bets paid as if 0, 00

absent (advantage to casino).

Bet 1: “high number”: win with 19–36, lose otherwise. Bet $1, win

$1 if win. Let W be winnings on one play; P (W = 1) = 18/38,

P (W = −1) = 20/38. Then

E(W ) = 1 · 18

38+ (−1) · 20

38= − 2

38≃ −$0.05.

313

Bet 2: “lucky number”: win if 24 comes up, lose otherwise. Win $35

for $1 bet. Now P (W = 35) = 1/38, P (W = −1) = 37/38, so

E(W ) = 35 · 1

38+ (−1) · 37

38= − 2

38≃ −$0.05.

In both bets, lose 5 cents per $ bet in long run.

Play game not once but many times. Interested in total winnings, or

mean winnings per play. Let Wi be winnings on play i; then mean

winnings per play Mn over n plays is

Mn =1

n

n∑

i=1

Wi.

Investigate behaviour of Mn by simulation.

314

High-number, 30 plays:

0 5 10 15 20 25 30

−0.

40.

00.

20.

4

n

M_n

315

High-number, 1000 plays:

0 200 400 600 800 1000

−0.

40.

00.

20.

4

n

M_n

316

Lucky-number, 1000 plays:

0 200 400 600 800 1000

−0.

40.

00.

20.

4

n

M_n

317

Notes about roulette simulation

1st graph: in high-number bet, fortune goes up/down by $1 per play;

winnings/play pattern similar. On this sequence, in profit after 30

plays, but losing after 15.

2nd graph: same bet, 1000 plays. Less fluctuation after more trials;

winnings per play apparently tending to dotted line, E(W ). (Other

simulations have different shape but similar end behaviour.)

3rd graph: lucky-number bet, 1000 plays. Large jump upwards on

each win. Picture more erratic than for high-number bet; long-term

behaviour not clear yet. (Need more plays.)

318

Summary

• Roulette: bet on number or number combination. Each spin of

wheel gives winning number; win if that number part of your

combination, lose otherwise.

• Amount you win determined as if 0, 00 absent from wheel.

• Expected winnings from most bets −$0.05 (per $ bet).

• Investigate bets by simulation; do 1000 simulated plays:

– high number bet: simulated very close to expectation.

– lucky-number bet: simulated not close to expectation

because results more variable.

319

Understanding Mn mathematically: mean, variance

Mn =1

n

n∑

i=1

Wi

is sum. Wi in sum independent, each same distribution (one spin of

wheel has no effect on other spins). So can calculate E(Mn) and

Var(Mn).

Already found E(Wi) = − 238

for both our bets.

Find variances for bets: for high-number bet, Var(Wi) = 0.9972;

for lucky-number bet Var(Wi) = 33.21.

320

For mean:

E(Mn) =1

n

n∑

i=1

E(Wi) =1

n

n∑

i=1

(

− 2

38

)

= − 2

38,

since there are n terms in the sum, all the same.

That is, regardless of how long you play, you will lose 5 cents per $

bet on average.

321

Var(Mn) =1

n2

n∑

i=1

Var(Wi) =Var(Wi)

n.

Sum has n terms all equal to variance of one play’s winnings. So for

high-number bet, Var(Mn) = 0.9972/n, for lucky-number bet,

Var(Mn) = 33.21/n.

For any particular n, variance for high-number bet lower. Supports

simulation: high-number bet results more predictable.

In both cases, as n → ∞, Var(Mn) → 0. Longer you play, more

predictable Mn is.

322

Distribution of Mn

Mean and variance not whole story – want to know things like

P (Mn > 0) (chance of profit). For this, need distribution of Mn.

Start with M2 (2 plays). Do lucky-number bet (P (W = 35) = 138

,

P (W = −1) = 3738

).

4 possibilities:

• win both times. M2 = (35 + 35)/2 = 35;

P (M2 = 35) = ( 138

)2 = 11444

≃ 0.0007.

• win on 1st, lose on 2nd. M2 = (35 + (−1))/2 = 17; prob is138

· 3738

= 371444

.

323

• lose on 1st, win on 2nd. Again M2 = 17 and prob is same as

above. Thus overall P (M2 = 17) = 741444

≃ 0.0512.

• lose on both. M2 = ((−1) + (−1))/2 = −1;

P (M2 = −1) = (3738

)2 = 13691444

≃ 0.9480.

Calculation complicated, even for n = 2, because have to consider

all possible combinations.

In general: this kind of distribution very difficult to find exactly. So

look for approximations to it.

324

Summary

• Mn =∑

i Wi/n is sum, so can find E(Mn) = −2/38 for

both bets, Var(Mn) = 0.9972/n for high-number,

Var(Mn) = 33.21/n for lucky-number.

• For fixed n, average winnings more predictable for high-number

bet.

• As n → ∞, Var(Mn) → 0 in both cases.

• Actual distribution of Mn difficult to find. Seek approximation.

325

Sampling distributions

Suppose X1, X2, . . . , Xn are random variables, each independent

and with same distribution. For example:

• Xi is winnings from i-th play of a roulette bet.

• Xi is height of i-th randomly chosen Canadian.

• Xi = 1 if randomly chosen voter supports Liberal party,

Xi = 0 otherwise.

• Xi is randomly generated value from a distribution with density

fX(x).

In each case: underlying phenomenon of interest, collect data at

random to help understand phenomenon.

326

Summarize Xi values using random variable

Yn = h(X1, X2, . . . , Xn) for some function h (eg. mean, like

Mn).

Some jargon:

• total collection of individuals (all possible spins of roulette

wheel, all Canadians, all possible values) called population .

• particular individuals selected, or Xi values obtained from

them, called sample .

• Yn defined above called sample statistic.

Usually don’t know about population, so draw conclusion about it

based on sample.

327

First: opposite problem: if we know population, find out what

samples from it look like.

“At random” important, and specific. Each individual value in

population must have correct chance of being in sample (same

chance, for human populations), and each must be in sample or not

independently of others.

Aim: learn about distribution of Yn, called sampling distribution .

General statements difficult. Approach: find what happens as

n → ∞, then use result as approximation for finite n.

328

Convergence in probability; weak law of

large numbers

In mathematics, accustomed to convergence ideas. Eg. if

an = 1 − 1/n, so that a1 = 0, a2 = 12, a3 = 2

3, etc., an → 1

(converges to 1) as n → ∞ because, by taking n large enough, all

values after an as close to 1 as desired.

For sequence X1, X2, . . . of random variables, what is meaning of

Xn → Y , where Y is random variable?

329

Different possibilities. One idea: “prob of Xn being far from Y goes

to 0 as n gets large”. Leads to definition:

Sequence {Xn} converges in probability to Y if, for all ǫ > 0,

limn→∞ P (|Xn − Y | ≥ ǫ) = 0. Notation: XnP→ Y .

Example: suppose U ∼ Uniform[0, 1]. Let Xn = 3 when

U ≤ 23(1 − 1

n) and 8 otherwise.

Thus when n = 1, X1 must be 8. If U > 23, Xn remains 8 forever,

but if U ≤ 23, U ≤ 2

3(1 − 1

n) eventually, so Xn becomes 3 for

some n, then remains 3 forever.

(Cannot know which will happen since U random variable.)

330

Now define Y = 3 if U ≤ 23

and Y = 8 otherwise. Same as

“eventual” Xn, so should have XnP→ Y . Correct?

P (|Xn − Y | ≥ ǫ) = P (Xn 6= Y )

= P

(

2

3

(

1 − 1

n

)

< U <2

3

)

=2

3n.

This tends to 0 as n → ∞, so XnP→ Y .

331

Convergence to a constant

What if Y not random variable, but number?

Example: suppose Zn ∼ Exponential(n). Then E(Zn) = 1/n,

suggesting that Zn typically gets smaller and smaller. Does

ZnP→ 0?

P (|Zn − 0| ≥ ǫ) = P (Zn ≥ ǫ)

=

∫ ∞

ǫ

ne−nx dx = e−nǫ.

For any fixed ǫ, P (|Zn − 0| ≥ ǫ) → 0, so ZnP→ 0.

Important special case (usually easier to handle).

332

Convergence to mean

Suppose sequence {Yn} has E(Yn) = µ for all n. Then YnP→ µ

if P (|Yn − µ| ≥ ǫ) → 0.

But recall Chebychev’s inequality,

P (|Y − µY | ≥ a) ≤ Var(Y )/a2. Here:

P (|Yn − µ| ≥ ǫ) ≤ Var(Yn)

ǫ2.

For fixed ǫ, right side (and hence left side) tends to 0 if

Var(Yn) → 0, in which case YnP→ µ.

(Logically: if Var(Yn) getting smaller, Yn becoming closer to their

mean µ.)

333

Weak Law of Large Numbers

Return to X1, X2, . . . , Xn being a random sample from some

population with mean E(Xi) = µ and variance Var(Xi) = v.

Consider sample mean

Mn =1

n

n∑

i=1

Xi.

Intuitively, expect Mn to be “close” to population mean µ, and to get

closer as n increases (more information in larger sample).

Does MnP→ µ? Re-do roulette calculations to show that

E(Mn) = µ and Var(Mn) = Var(Xi)/n = v/n.

334

Now, {Mn} is sequence of random variables with same mean µ.

Result of section “convergence to mean” says that MnP→ µ if

Var(Mn) → 0. But here, Var(Mn) = v/n → 0. This proves

that MnP→ µ.

This justifies use of sample mean as estimate of the population

mean. Can estimate average height of all Canadians by measuring

average height of sample of Canadians; the larger the sample,

closer estimate will likely be.

Important result, called weak law of large numbers .

335

To generalize: suppose now that Xn do not all have same variance,

but Var(Xi) = vi. Then

Var(Mn) =1

n2

n∑

i=1

vi.

This might not → 0. But suppose that vi ≤ v for all i. Then

Var(Mn) =1

n2

n∑

i=1

vi ≤1

n2

n∑

i=1

v =v

n→ 0.

In other words, MnP→ µ even if the variances are not all equal,

provided that they are bounded.

336

Convergence with probability 1

Previous example: suppose U ∼ Uniform[0, 1]. Let Xn = 3

when U ≤ 23(1 − 1

n) and 8 otherwise. Let Y = 3 if U ≤ 2

3and

Y = 8 otherwise. Concluded that XnP→ Y .

Take another approach. Suppose we knew U , eg. suppose

U = 0.4. Then

0.4 ≤ 2

3

(

1 − 1

n

)

⇒ n ≥ 5

2.

Thus X1 = X2 = 8, X3 = X4 = · · · = 3. This is ordinary

sequence of numbers, converges to 3. Also, if U = 0.4, Y = 3.

337

In general: if U < 23, Xn = 8 for n ≤ 2/(2 − 3U) and Xn = 3

after that. If U > 23, Xn = 8 for all n.

In both cases, Xn → Y as ordinary sequence for any particular

value of U . Potentially different idea of convergence of random

variables.

Definition: Xn converges to Y with probability 1 if

P (limn→∞ Xn = Y ) = 1. Also “converges almost surely”;

notation Xna.s.→ Y .

In words: consider all ways to get (number) sequences {Xn}; for

each, consider corresponding Y . If Xn → Y always, then

Xna.s.→ Y .

338

Is it same as convergence in probability?

Example: let U ∼ Uniform[0, 1], and define {Xn} like this:

• X1 = 1 if 0 ≤ U < 12, 0 otherwise

• X2 = 1 if 12≤ U < 1, 0 otherwise

• X3 = 1 if 0 ≤ U < 14, 0 otherwise

• X4 = 1 if 14≤ U < 1

2, 0 otherwise

• X5 = 1 if 12≤ U < 3

4, 0 otherwise

• X6 = 1 if 34≤ U < 1, 0 otherwise

• X7 = 1 if 0 ≤ U < 18, 0 otherwise

• X8 = 1 if 18≤ U < 1

4, 0 otherwise, etc.

339

(Divided [0, 1] into 2, then 4, then 8,. . . intervals.)

Intervals getting shorter, so P (Xn = 1) decreasing. Indeed, for

ǫ < 1, P (|Xn − 0| ≥ ǫ) = P (Xn = 1) → 0, so XnP→ 0.

Suppose U = 0.2. Then Xn = 0 except for

X1 = X3 = X8 = · · · = 1. Beyond any n, always another

Xn = 1 (always another interval containing 0.2). So for U = 0.2,

number sequence {Xn} has no limit. Hence not true that Xna.s.→ 0.

Example shows that two convergence ideas different – convergence

with probability 1 harder to achieve.

340

Strong law of large numbers

Random sample X1, X2, . . . , Xn with E(Xi) = µ,

Var(Xi) ≤ v; let Mn = (∑n

i=1 Xi)/n be sample mean.

Already showed that MnP→ µ (“weak law of large numbers”).

Also strong law of large numbers : Mna.s.→ µ. Proof difficult.

In words: out of (infinitely) many different sequences {Mn}obtainable, every one of them converges to µ.

341

Summary

• Population: “all possible” values of a random variable.

• Sample: observe X1, X2, . . . , Xn.

• Calculate sample statistic h(X1, . . . , Xn) (eg. mean, median).

Want to know sampling distribution of h over repeated samples,

assuming (for now) that population known.

• General results difficult; find out what happens as n → ∞, and

use as approximation for finite n.

• Xn converges in probability to Y (XnP→ Y ) if

P (|Xn − Y | ≥ ǫ) → 0 as n → ∞ for all ǫ.

342

• Convergence to mean: if Var(Yn) → 0), then YnP→ µ

(Chebychev).

• Sample mean MnP→ µ (pop. mean). Called weak law of large

numbers. “Larger sample is more informative”.

• Xn converges to Y with probability 1 if

P (limn→∞ Xn = Y ) = 1. Also “converges almost surely”;

notation Xna.s.→ Y . All possible (number) sequences {Xn}

converge to corresponding Y .

• a.s.→ more demanding thanP→.

• Strong law of large numbers: Mna.s.→ µ.

343

Convergence in distribution

Consider independent sequence of random variables {Xn} with

P (Xn = 1) = 12

+ 1n

and P (Xn = 0) = 12− 1

n. Also, let

P (Y = 0) = P (Y = 1) = 12

independently of the Xn.

Now, take ǫ < 1. Then P (|Xn − Y | ≥ ǫ) = P (Xn 6= Y ). Could

have Xn = 0, Y = 1 or Xn = 1, Y = 0; use independence:

P (Xn 6= Y ) =

(

1

2− 1

n

)

1

2+

(

1

2+

1

n

)

1

2=

1

2.

Not → 0, so not true that XnP→ Y .

344

But Xn does converge to Y in sense that

P (Xn = 1) → 12

= P (Y = 1) and

P (Xn = 0) → 12

= P (Y = 0). Called convergence in

distribution .

To make definition: note that P (Xn = x) meaningless for

continuous Xn, so work with P (Xn ≤ x) instead.

Then: {Xn} converges in distribution to Y if

P (Xn ≤ x) → P (Y ≤ x) for all x. Notation: XnD→ Y .

345

Example: Poisson approximation to binomial

Suppose Xn ∼ Binomial(n, λ/n) (that is, trials increasing but

success prob decreasing so that E(X) = n(λ/n) = λ constant.

Then

P (Xn = j) =

(

n

j

)(

λ

n

)j (

1 − λ

n

)n−j

→ e−λλj

j!,

which is P (Y = j) when Y ∼ Poisson(λ). That is,

XnD→ Poisson(λ).

(Proof based on limn→∞(1 − (x/n))n = e−x.)

Suggests that if n large and θ small, Poisson is good approx to

binomial.

346

Try this: take λ = 1.5 for n = 2, 5, 10, 20, 100:

x n=2 n=5 n=10 n=20 n=100 Poisson

0 0.0625 0.1680 0.1968 0.2102 0.2206 0.2231

1 0.3750 0.3601 0.3474 0.3410 0.3359 0.3346

2 0.5625 0.3087 0.2758 0.2626 0.2532 0.2510

3 0.0000 0.1323 0.1298 0.1277 0.1259 0.1255

4 0.0000 0.0283 0.0400 0.0440 0.0465 0.0470

5 0.0000 0.0024 0.0084 0.0114 0.0136 0.0141

6 0.0000 0.0000 0.0012 0.0023 0.0032 0.0035

Approx for n = 20 not bad; for n = 100 is very good.

347

Convergence in distribution and moment generating

functions

Moment-generating function mY (s) for random variable Y is

function of s.

Uniqueness theorem: if mX(s) = mY (s) for all s where both

finite, then X,Y have same distribution.

Suggests following (true) result: if {Xn} is sequence of random

variables with mXn(s) → mY (s) (for all s where both sides finite),

then XnD→ Y .

348

Summary

• XnD→ Y if P (Xn ≤ x) → P (YN ≤ x) for all x.

• If Xn ∼ Binomial(n, λ/n) and Y ∼ Poisson(λ), then

XnD→ Y . (If n large, θ small, Poisson good approx. to

binomial.)

• If mXn(s) → mY (s) for all valid s, then Xn

D→ Y .

349

Central Limit Theorem

Return to “random sample” X1, X2, . . . , Xn; suppose E(Xi) = 0

and Var(Xi) = 1.

Define Mn = (∑n

i−1 Xi)/n. Does Mn converge in distribution to

anything interesting?

Well, E(Mn) = 0 but Var(Mn) = 1/n → 0. So look instead at

Zn =√

nMn: E(Zn) = 0 and Var(Zn) = 1. Then

Zn = (∑n

i=1 Xi)/√

n.

350

Moment-generating function for Xi is

mXi(s) = 1 + sE(Xi) +

s2

2!E(X2

i ) +s3

3!E(X3

i ) + · · · ;

here E(Xi) = 0, Var(Xi) = 1 so E(X2i ) = 1, giving

mXi(s) = 1 +

s2

2+

s3

3!E(X3

i ) + · · · .

Now, by rules for mgf’s,

mZn(s) = mX1

(s/√

n) · mX2(s/

√n) · · · · · mXn

(s/√

n)

= {mXi(s/

√n)}n

=

(

1 +s2

2n+

s3

3!n3/2E(X3

i ) + · · ·)n

.

351

Recall that as n → ∞, (1 + y/n)n → ey. Above, the terms in s3

and higher contribute less and less as n increases, so only the 1

and s2/n terms in bracket have effect. Thus

limn→∞

mZn(s) = lim

n→∞

(

1 +s2

2n

)n

= es2/2

which is mgf of standard normal distribution.

Thus, remarkable fact: regardless of distribution of Xi,

ZnD→ N(0, 1).

Also works for Xi with any mean and variance: standardized

MnD→ N(0, 1). Called central limit theorem .

352

Exact distribution of Mn very difficult to find. But if n “large”,

distribution can be approximated very well by normal distribution,

easier to work with.

This is reason for studying normal distribution.

Note that theorem uses convergence in distribution, so that it is the

cdf that converges, not the density function. Important if Xi discrete.

Also, for approximation, don’t need to be so careful about

standardization. Any sum/mean for large n works.

353

CLT by simulation

Let U1, U2, . . . ∼ Uniform[0, 1]; investigate distribution of

Yn = (U1 + U2 + · · · + Un)/n for various n. Uniform[0, 1]

distribution completely unlike normal. Do by simulation:

1. choose “large” number of Yn’s to simulate (eg. nsim = 10, 000)

2. in each of n columns, generate nsim random values from

Uniform[0, 1]

3. calculate simulated Yn values as row means. Eg. for n = 5,

let c10=rmean(c1-c5).

4. Draw histogram of results, compare normal distribution shape.

Normal good if curve through top middle of histogram bars.

354

Histogram of y

y

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

n = 2: normal too high at top, too low elsewhere.

355

Histogram of y

y

Den

sity

0.2 0.4 0.6 0.8

0.0

1.5

3.0

n = 5: much closer approx.

356

Histogram of y

y

Den

sity

0.3 0.4 0.5 0.6 0.7

02

4

n = 20: almost perfect.

357

Normal approx to binomial

Binomial is sum of Bernoullis, so CLT should apply if #trials n large.

Suppose Y ∼ Binomial(4, 0.5). Then E(Y ) = 2, Var(Y ) = 1.

Exact P (Y ≤ 1):

P (Y ≤ 1) =

(

4

0

)

(0.5)0(1−0.5)4+

(

4

1

)

(0.5)1(0.5)3 = 0.3125.

Take X ∼ N(2, 1) (same mean, variance as Y ). P (X ≤ 1)?

P (X ≤ 1) = P

(

Z ≤ 1 − 2√1

)

= P (Z ≤ −1) = 0.1587.

Not very close!

358

Problem: X continuous, but Y discrete. Y ≤ 1 really “Y ≤anything rounding to 1”. Suggests approximating P (Y ≤ 1) by

P (X ≤ 1.5):

P (X ≤ 1.5) = P

(

Z ≤ 1.5 − 2√1

)

= P (Z ≤ −0.5) = 0.3085.

For such small n, really very close to P (Y ≤ 1) = 0.3125.

In general, add 0.5 for ≤ and subtract 0.5 for <. Called continuity

correction ; do whenever discrete distribution approximated by

continuous.

(Alternatively: for binomial, P (Y ≤ 1) 6= P (Y < 1), but for

normal, P (X ≤ 1) = P (X < 1).)

359

Compare Y ∼ Binomial(20, 0.5); E(Y ) = 10, Var(Y ) = 5.

Then exact P (Y ≤ 8) = 0.2517; approx by X ∼ N(10, 5) as

P (Y ≤ 8) ≃ P (X ≤ 8.5)

= P

(

Z ≤ 8.5 − 10√5

)

= P (Z ≤ −0.67) = 0.2514.

Now, approx very good.

360

If p 6= 0.5, binomial skewed; skewness decreases as n increases.

So need larger n for p far from 0.5.

Example: n = 20, p = 0.1. Simulate and plot using Minitab:

MTB > random 1000 c3;

SUBC> binomial 20 0.1.

MTB > hist c3

Shape clearly skewed, not normal. n = 20 not large enough here.

Rule of thumb: normal approx OK if np ≥ 5 and n(1 − p) ≥ 5.

Examples: n = 4, p = 0.5: np = 2 < 5, no good.

n = 20, p = 0.5: np = n(1 − p) = 10 ≥ 5, good;

n = 20, p = 0.1: np = 2 < 5, no good.

361

Summary

• Central Limit Theorem: if E(Xi) = 0 and Var(Xi) = 1, and

Zn =∑n

i=1 Xi/√

n, then ZnD→ N(0, 1) even though Xi

could have any distribution (with finite variance).

• Proof also works with E(Xi) = µ,Var(Xi) = σ2; define Zn

using standardized Xi.

• Can assess Central Limit Theorem by simulation.

• Can approximate binomial with large n by normal (continuity

correction).

362

Monte Carlo integration

Integral I =∫ 1

0sin(x4) dx: impossible algebraically (no

antiderivative). Get approximate answer numerically eg. by

Simpson’s rule. But can also recognize that

I = E{sin(U4)}

where U ∼ Uniform[0, 1]. I is “average” of sin(U4), suggesting

procedure:

1. Generate U randomly from Uniform[0, 1].

2. Calculate T = sin(U4)

3. Repeat steps 1 and 2 many times, find mean value m of T .

363

Minitab commands to do this (U in c1, T in c2):


SUBC> uniform 0 1.

MTB > let c2=sin(c1**4)

MTB > mean c2

I got m = 0.19704. How accurate?

m observed value of random variable M . M mean of 1000 values,

so central limit theorem applies: approx normal distribution.

Mean, variance unknown but estimate using sample mean 0.19704,

sample SD 0.25221: E(M) ≃ 0.19704,

Var(M) = σ2/n ≃ 0.252212/1000 = 6.36 × 10−5.

364

Now, 99.7% of normal distribution within mean ± 3 × SD, so I

almost certainly in

0.19704 ± 3√

6.36 × 10−5 = (0.189, 0.205).

To get more accurate answer, get more simulated values.

365

Recognizing as expectation

Consider now I =∫∞0

5x cos(x2)e−5x dx.

Again impossible algebraically; because of limits, can’t use previous

trick.

Idea: use distribution with right limits and density in integral. Here,

Exponential(5) has density 5e−5x on correct interval, so

I = E{X cos(X2)} where X ∼ Exponential(5).

Minitab annoyance: its exponential dist has parameter 1/λ, so we

have to feed in 1/5 = 0.2.

366

Commands:


SUBC> exponential 0.2.

MTB > let c2=c1*cos(c1**2)

MTB > describe c2

I got mean 0.1884, SD 0.1731, so this area almost certainly in

0.1884 ± 30.1731√

1000= (0.1720, 0.2048).

367

Summary

• Recognize integral as expectation of a distribution (integral has

correct limits and density function inside).

• Generate random values from distribution, compute for each the

function that integral is expectation of.

• Estimated integral is mean of computed values (use sample SD,

and say integral almost certainly ±3SD from mean).

368

Approximating sampling distributions

Central Limit Theorem only applies to means (sums), so is no help

for other quantities (median, variance etc).

Can approximate sampling distributions for these by simulation.

Idea:

1. simulate random sample from population

2. calculate sample quantity

3. repeat steps 1 and 2 many times, summarize results.

369

Sampling distribution of sample median in normal

population

Suppose X1, X2, . . . , Xn is random sample from normal

population mean 10, SD 2; take n = 3.

MTB > Random 500 c1-c3;

SUBC> Normal 10 2.

MTB > RMedian c1-c3 c4.

Samples in rows; use “row statistics” to get sample medians.

370

Shape is very like normal, even for such small sample.

371

Sampling distribution of sample variance in normal

population

Again suppose X1, X2, . . . , Xn ∼ N(10, 22). Now take n = 5:

MTB > Random 500 c1-c5;

SUBC> Normal 10 2.

MTB > RStDev c1-c5 c6.

MTB > let c7=c6*c6

MTB > histogram c7

(samples in rows again; variance as square of SD.)

372

Shape definitely skewed right: not normal-shaped.

373

Summary

• Approximate sampling distribution of quantity by:

– simulate random sample from population

– calculate sample quantity

– repeat many times, summarize results (histogram)

• Sampling distribution of sample median close to normal.

• Sampling distribution of sample variance skewed to right.

374

Normal distribution theory

Normal distribution arises often from CLT, so worth knowing

properties and related distributions. These used frequently in

Chapter 5 and beyond (STAB57).

First: suppose U, V are independent. Then Cov(U, V ) =

E(UV ) − E(U)E(V ) = E(U)E(V ) − E(U)E(V ) = 0 as

expected.

But: now suppose that Cov(U, V ) = 0. If U, V normal, then (fact)

U, V independent.

That is, for normal U, V , Cov(U, V ) = 0 if and only if U, V

independent. Not true for other distributions.

375

The chi-squared distribution

Suppose Z ∼ N(0, 1). What is distribution of W = Z2? Can’t

use usual transformation because Z2 neither increasing nor

decreasing.

FW (w) = P (W ≤ w) = P (Z2 ≤ w) = P (−√w ≤ Z ≤ √

w).

This as integral is

FW (w) =

∫

√w

−√w

e−z2/2

√2π

dz =

∫

√w

−∞

e−z2/2

√2π

dz−∫ −√

w

−∞

e−z2/2

√2π

dz.

376

Differentiate both sides and simplify to get

fW (w) =1√2πw

e−w/2.

This is called chi-squared distribution with 1 degree of freedom

(df). Written W ∼ χ21.

Now suppose Z1, Z2, . . . , Zn ∼ N(0, 1) independently.

Distribution of W = Z21 + Z2

2 + · · ·+ Z2n called chi-squared with

n degrees of freedom . Written W ∼ χ2n.

What is E(W )?

E(W ) = E

(

n∑

i=1

Z2i

)

=n∑

i=1

E(Z2i ) = n(1) = n

since E(Z2i ) = Var(Zi) = 1.

377

To get density function of χ21, compare gamma density with χ2

1:

λαwα−1

Γ(α)e−λw =

1√2πw

e−w/2

if α = 12

and λ = 12. That is, χ2

1 = Gamma(12, 1

2).

If Z2i ∼ χ2

1, use mgf formula for gamma dist to write

mZ2i(s) =

(

1

2

)1/2(1

2− s

)−1/2

.

378

If W =∑n

i=1 Z2i ∼ χ2

n, mgf of W is n copies of mZ2i(s)

multiplied together, ie.

MW (s) =

(

1

2

)n/2(1

2− s

)−n/2

which is mgf of Gamma(n/2, 12). Using formula for gamma

density, then, for W ∼ χ2n,

fW (w) =1

2n/2Γ(n/2)wn/2−1e−w/2.

Has skew-to-right shape (picture page 225).

379

Distribution of sample variance

Suppose X1, X2, . . . , Xn ∼ N(µ, σ2). Define X̄ =∑n

i=1 Xi/n

to be sample mean, S2 =∑n

i=1(Xi − X̄)2/(n− 1) to be sample

variance.

Know that X̄ ∼ N(µ, σ2/n). Distribution of S2?

Actually look at (n − 1)S2/σ2 =∑n

i=1(Xi − X̄)2/σ2. Can write

(p. 235) as sum of n − 1 squared N(0, 1)’s, so

(n − 1)S2

σ2∼ χ2

n−1.

Fact: E(S2) = σ2 (explains division by n − 1).

380

The t distribution

Standardize X̄ :X̄ − µ√

σ2/n∼ N(0, 1).

But what if σ2 unknown? Idea: replace σ2 by sample variance S2.

Distribution of result no longer normal (even though Xi are).

X̄ − µ√

S2/n=

X̄ − µ√

σ2/n· 1√

(n − 1)S2/σ2/(n − 1)=

Z√

Y/(n − 1)

where Z ∼ N(0, 1) and Y ∼ χ2n−1.

This called t distribution with n − 1 degrees of freedom , written

tn−1.

381

What happens as n increases? Write

Y/(n − 1) =∑n−1

i=1 Z2i /(n − 1) where Zi ∼ N(0, 1). Then

E(Y/(n − 1)) = 1. Let k = Var(Z2i ); then

Var(Y/(n − 1)) = (n − 1)k/(n − 1)2 = k/(n − 1) → 0.

That is, Y/(n − 1)P→ 1 and therefore

Z√

Y/(n − 1)

D→ N(0, 1);

that is, for large n, the t distribution with n − 1 df well approximated

by N(0, 1).

t distribution hard to work with; use tables/software for probabilities.

382

The F distribution

Suppose S21 and S2

2 sample variances from independent samples

sizes m, n, both from normal populations with variance σ2. Then

might compare variances by looking at ratio R = S21/S

22 :

R =S2

1

S22

=(m − 1)S2

1/σ2

(n − 1)S22/σ

2· 1/(m − 1)

1/(n − 1)=

X/(m − 1)

Y/(n − 1)

where X ∼ χ2m−1 and Y ∼ χ2

n−1.

This defined to have F distribution with m − 1 and n − 1

degrees of freedom , written F (m − 1, n − 1).

383

Properties of F distribution

Ratio could have been S22/S

21 = 1/R with similar result: therefore,

if R ∼ F (m − 1, n − 1), then 1/R ∼ F (n − 1,m − 1).

Suppose T = X/√

Y/(n − 1) ∼ tn−1. Then

T 2 =X2/1

Y/(n − 1)

is a χ21/1 over χ2

n−1/(n − 1); that is, T 2 ∼ F (1, n − 1).

384

In

R =X/(m − 1)

Y/(n − 1):

if n → ∞, know that Y/(n − 1)P→ 1, and numerator of

R ∼ χ2m−1/(m − 1).

Hence, as n → ∞,

(m − 1)RD→ χ2

m−1.

Thus χ2m−1 is useful approx to F (m − 1, n − 1) if n large.

385

Summary

• Normal distribution often from CLT, so worth knowing about

related distributions.

• If Z ∼ N(0, 1), W = Z2 ∼ χ21.

• ∑ni=1 Z2

i ∼ χ2n. Skewed to right.

• If W ∼ χ2n, E(W ) = n.

• For normal Xi, If S2 =∑n

i=1(Xi − X̄)2/(n − 1) is sample

variance, (n − 1)S2/σ2 ∼ χ2n−1.

• (X̄ − µ)/√

S2/n ∼ tn−1.

• As n increases, tnD→ N(0, 1).

386

• If S21 , S

22 variances from 2 independent samples, sizes m and

n, ratio R = S21/S

22 ∼ F (m − 1, n − 1).

• t2n−1 and F (1, n − 1) have same distribution.

• As n increases, (m − 1)RD→ χ2

m−1.

387

Stochastic Processes

388

Random walks

Consider gambling game: win $1 with prob p, lose $1 with prob q

(p + q = 1). Each play independent. Start with fortune a; let Xn

denote fortune after n plays.

Thus X0 = a; X1 = a + 1 if win (prob p), X1 = a − 1 if lose

(prob q).

Sequence {Xn} of random variables called random walk .

389

Properties of random walk

At each step, two possible outcomes (win/lose), same prob p of

winning, independent. So number of wins Wn ∼ Binomial(n, p).

With Wn wins, must be n − Wn losses, so fortune after Wn wins is

Xn = a + (1)Wn + (−1)(n − Wn) = a + 2Wn − n.

Since E(Wn) = np, have

E(Xn) = a + 2np − n = a + 2n

(

p − 1

2

)

.

Also

Var(Xn) = 22 Var(Wn) = 4np(1 − p).

390

Since Wn ∼ Binomial(n, p), have

P (Wn = j) =

(

n

j

)

pjqn−j;

write in terms of Xn to get

P (Xn = a + k) = P (a + k = a + 2Wn − n)

= P (Wn = (n + k)/2)

=

(

n

(n + k)/2

)

p(n+k)/2q(n−k)/2.

Only certain values of Xn possible; formula fails for impossible

values.

391

Examples

Suppose a = 5, p = 14. Then

E(Xn) = 5 + 2n(14− 1

2) = 5− n/2. Expect fortune to decrease

on average.

What is P (X3 = 6)? Write 6 = 5 + 1 so k = 1, n = 3;

(n + k)/2 = 2 and (n − k)/2 = 1:

P (X3 = 6) =

(

3

2

)(

1

4

)2(3

4

)1

=9

64.

How about P (X9 = 7)? This is P (X9 = 5 + 2), so n = 9 and

k = 2. But (n + k)/2 = (5 + 2)/2 not integer, so formula fails.

X9 cannot be 7 (in fact X9 must be even).

392

Now suppose a = 20, p = 23. Then

E(Xn) = 20 + 2n

(

2

3− 1

2

)

= 20 + n/3,

increasing with n.

Find P (X5) = 21 = 20 + 1: n = 5, k = 1 so (n + k)/2 = 3,

(n − k)/2 = 2 and

P (X5 = 21) =

(

5

3

)(

2

3

)3(1

3

)2

≃ 0.329,

fairly likely.

393

Gambler’s ruin

Suppose we gamble with aim to reach fortune c > 0. How likely do

we succeed before fortune reaches 0 (run out of money)?

Hard to see answer: no idea how long it takes to reach c or 0.

Idea: let S(a) be prob of reaching c first starting from fortune a.

Then for all c > 0, S(0) = 0, S(c) = 1. Also, if current fortune a,

fortune at next step either a + 1 or a − 1, leading to

S(a) = pS(a + 1) + qS(a − 1).

394

Solve above recurrence relation to get formula: if p = 12,

S(a) = a/c; otherwise,

S(a) =1 − (q/p)a

1 − (q/p)c.

Example: start with $20, want to win $50. If p = 12, chance of

success is 20/50 = 0.4. If p = 0.51, chance of success is

S(20) =1 − (0.49/0.51)20

1 − (0.49/0.51)50≃ 0.637.

Even a very small edge makes success much more likely. (Even

small disadvantage makes eventual failure much more likely.)

395

Markov Chains

Simple model of weather:

• if sunny today, prob 0.7 of sunny tomorrow, prob 0.3 of rainy.

• if rainy today, prob 0.4 of sunny tomorrow, prob 0.6 of rainy.

Weather has two states (sunny, rainy). From one day to next,

weather may change state.

Probs above called transition probabilities . This kind of probability

model called Markov chain .

396

Can write as matrix:

P =

0.7 0.3

0.4 0.6

where element pij is P (go to state j|currently state i).

Note assumption: only need to know weather today to predict

weather tomorrow. (If weather today known, past weather

irrelevant). Called Markov property .

Suppose sunny today. Chance of sun in two days?

One idea: list possibilities. Two: SSS, SRS. Use transition probs to

get (0.7)(0.7) + (0.3)(0.4) = 0.61.

397

Another: calculate matrix P 2:

P 2 =

0.7 0.3

0.4 0.6

0.7 0.3

0.4 0.6

=

0.61 0.39

0.52 0.48

.

Note that top-left calculation same as 1st idea above.

Matrix P 2 gives two-step transition probs. That is, if sunny today,

prob of sunny in 2 days’ time 0.61; if rainy today, almost even

chance of being rainy in 2 days.

In general, P n gives n-step transition probs (weather in n days’

time given weather today).

398

Another example

“Ehrenfest’s Urn”: Two urns, containing total of 4 balls. Choose one

ball at random, take out of current urn, place in other urn. Keep

track of number of balls in urn 1.

Transition matrix (states 0, 1, 2, 3, 4 balls in urn 1):

P =

0 1 0 0 0

14

0 34

0 0

0 24

0 24

0

0 0 34

0 14

0 0 0 1 0

Apparent tendency for number of balls in 2 urns to even out.

399

Find likely number of balls in urn 1 after 9 steps by finding P 9. (Use

Minitab: see section E.1 of manual, p. 162.) Answer (rounded):

P 9 =

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

Start with even number of balls in urn 1: end with either odd

number, equally likely. Start with odd number: end with even

number, most likely 2.

400

Stationary distributions

Instead of starting from particular state, pick starting state from

prob. distribution θ = (θ1, θ2, . . .).

In weather example: suppose 80% chance today sunny, so

θ = (0.80, 0.20).

To get prob of each state n steps later, multiply θ as row vector by

P n. Weather example, for n = 2 days later:

(

0.8 0.2)

P 2 =(

0.8 0.2)

0.61 0.39

0.52 0.48

=(

0.592 0.408)

.

401

Suppose we could find θ such that θP = θ. Then starting

distribution θ would be stationary : (marginal) prob of sunny day

same for all days.

Can try directly for weather example:(

θ1 θ2

)

P =(

0.7θ1 + 0.4θ2 0.3θ1 + 0.6θ2

)

=(

θ1 θ2

)

.

2 equations in 2 unknowns, collapse into one equation

0.3θ1 − 0.4θ2 = 0, but θi are probs so that θ1 + θ2 = 1 also.

Solve: θ1 = 47, θ2 = 3

7.

More generally: solve θP = θ by transposing both sides to get

P T θT = θT . Like solution to Av = λv with λ = 1: stationary

prob θ is eigenvector of P T with eigenvalue 1.

402

Can use Minitab to get eigenvalues/vectors (manual p. 167). Usually

need to scale eigenvector to get probs summing to 1.

Ehrenfest urn example: 5 eigenvectors; one with eigenvalue 1 is

(0.120, 0.478, 0.717, 0.478, 0.120), scaling to 116

, 416

, 616

, 416

, 116

.

(Actually binomial probs: see text p. 595).

403

Limiting distributions

If initial state chosen from stationary distribution, then prob of each

state remains same for all time.

Also: if watch Markov chain for many steps, should not matter much

which state we began in.

Weather example: 8-step transition matrix is

P 8 =

0.57146 0.42854

0.57139 0.42861

≃

47

37

47

37

Starting either from sunny or rainy day, chance of sunny day in 8

days’ time is about 47. Called the limiting distribution, here same as

stationary distribution.

404

Compare Ehrenfest urn example:

P 8 ≃

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

0 0.5 0 0.5 0

0.125 0 0.75 0 0.125

not getting stationary distribution in each row.

Problem here: number of balls in urn 1 always goes from odd to

even or vice versa. So eg. P (1 ball in urn 1 after n steps)

alternates between 0 and positive; cannot have limit. Chain called

periodic .

405

To test whether a chain is periodic, think about how long can it take

to get back to the state I’m in now?. In the Ehrenfest urn example,

can get back to current state in 2, 4, 6, . . . steps, always a multiple

of 2. Thus the period of any state is 2.

A chain where every state has period 1 is called aperiodic.

406

Consider a third example:

P =

0.5 0.5 0

0.75 0.25 0

0 0 1

.

Search for stationary distribution: are two eigenvectors for

eigenvalue 1: (0.6, 0.4, 0) and (0, 0, 1).

Both stationary distributions in a way:

start in state 1 or 2, can never reach state 3. Start in state 3, can

never reach states 1 or 2.

Such chain called reducible : can split up into two chains, {1, 2}and {3} and treat each separately.

407

Markov chain limit theorem

Previous work suggests following theorem:

Suppose a Markov chain has a stationary distribution, is not

reducible, and is not periodic. Then its stationary distribution also

gives the probability, as n → ∞, of being in any particular state

after n steps.

In effect, the stationary distribution gives approx to long-term

behaviour of chain.

408

Summary

• Random walk: sequence of r. v.’s with X0 = a,

Xn+1 = Xn + 1 with prob. p, Xn+1 = Xn − 1 with prob.

q = 1 − p.

• E(Xn) = a + 2n(p − 12); Var(Xn) = 4np(1 − p). E(Xn)

increasing function of n if p > 12.

• P (Xn = a + k) =(

n(n+k)/2

)

p(n+k)/2q(n−k)/2.

• Gambler’s ruin: does random walk reach 0 or c first? Prob. of

reaching c can be found; much greater even if p only slightly

bigger than 0.5.

409

• Markov Chain: set of states S, probs P (Sj |Si) arranged in

matrix P = {pij}. (Depend only on one previous step.)

• Matrix P k gives k-step transition probs.

• If starting state chosen at random with probs. (θ1, θ2, . . .) and

θP = θ, θ called stationary distribution for chain. Find by

solving eigenvalue problem.

410

• Limiting distribution is limn→∞ P n, if the limit exists. Gives

result of observing chain for many steps.

• State has period k if can only get back to state in a multiple of k

steps.

• Chain is irreducible if it is possible to move (in some number of

steps) from any state to any other state.

• For a Markov chain that is irreducible and aperiodic, if the

stationary distribution exists, is same as limiting distribution.

411

... that’s all, folks!

412

slidesutsc.utoronto.ca/~butler/notes/stab52.pdftitle slides.dvi created date 11/12/2006 3:41:16 pm

Documents