5 joint probability distributions and random samples

5Joint Probability

Distributions and Random Samples

5.1 Jointly Distributed Random Variables

3

Two Discrete Random Variables

4


The probability mass function (pmf) of a single discrete rv X specifies how much probability mass is placed on each possible X value.

The joint pmf of two discrete rv’s X and Y describes how much probability mass is placed on each possible pair of values (x, y).

DefinitionLet X and Y be two discrete rv’s defined on the sample space of an experiment. The joint probability mass function p (x, y) is defined for each pair of numbers (x, y) by

p (x, y) = P(X = x and Y = y)

5


It must be the case that p (x, y) 0 and p (x, y) = 1.

Now let A be any set consisting of pairs of (x, y) values (e.g., A = {(x, y): x + y = 5} or {(x, y): max (x, y) 3}).

Then the probability P[(X, Y) A] is obtained by summing the joint pmf over pairs in A:

P[(X, Y) A] = p (x, y)

6

Example 1

A large insurance agency services a number of customers who have purchased both a homeowner’s policy and an automobile policy from the agency. For each type of policy, a deductible amount must be specified.

For an automobile policy, the choices are $100 and $250, whereas for a homeowner’s policy, the choices are 0, $100, and $200.

Suppose an individual with both types of policy is selected at random from the agency’s files. Let X = the deductible amount on the auto policy and Y = the deductible amount on the homeowner’s policy.

7

Example 1

Possible (X, Y) pairs are then (100, 0), (100, 100),(100, 200), (250, 0), (250, 100), and (250, 200); the joint pmf specifies the probability associated with each one of these pairs, with any other pair having probability zero.

Suppose the joint pmf is given in the accompanying joint probability table:

cont’d

8

Example 1

Then p(100, 100) = P(X = 100 and Y = 100) = P($100 deductible on both policies) = .10.

The probability P(Y 100) is computed by summing probabilities of all (x, y) pairs for which y 100:

P(Y 100) = p(100, 100) + p(250, 100) + p(100, 200) + p(250, 200)

= .75

cont’d

9


Definition

The marginal probability mass function of X, denoted by pX (x), is given by

pX (x) = p (x, y) for each possible value x

Similarly, the marginal probability mass function of Y is

pY (y) = p (x, y) for each possible value y.

10

Example 2

Example 1 continued…The possible X values are x = 100 and x = 250, so computing row totals in the joint probability table yields

pX(100) = p(100, 0) + p(100, 100) + p(100, 200) = .50

and

pX(250) = p(250, 0) + p(250, 100) + p(250, 200) = .50

The marginal pmf of X is then

11

Example 2

Similarly, the marginal pmf of Y is obtained from column totals as

so P(Y 100) = pY(100) + pY(200) = .75 as before.

cont’d

12

Two Continuous Random Variables

13


The probability that the observed value of a continuous rv X lies in a one-dimensional set A (such as an interval) is obtained by integrating the pdf f (x) over the set A.

Similarly, the probability that the pair (X, Y) of continuous rv’s falls in a two-dimensional set A (such as a rectangle) is obtained by integrating a function called the joint density function.

14


Definition

Let X and Y be continuous rv’s. A joint probability density function f (x, y) for these two variables is a function satisfying f (x, y) 0 and

Then for any two-dimensional set A

15


In particular, if A is the two-dimensional rectangle{(x, y): a x b, c y d}, then

We can think of f (x, y) as specifying a surface at heightf(x, y) above the point (x, y) in a three-dimensional coordinate system.

Then P[(X, Y) A] is the volume underneath this surface and above the region A, analogous to the area under a curve in the case of a single rv.

16


This is illustrated in Figure 5.1.

Figure 5.1

P[(X, Y ) A] = volume under density surface above A

17

Example 3

A bank operates both a drive-up facility and a walk-up window. On a randomly selected day, let X = the proportion of time that the drive-up facility is in use and Y = the proportion of time that the walk-up window is in use.

Then the set of possible values for (X, Y) is the rectangle

D = {(x, y): 0 x 1, 0 y 1}.

18

Example 3

Suppose the joint pdf of (X, Y) is given by

To verify that this is a legitimate pdf, note that f (x, y) 0 and

cont’d

19

Example 3

The probability that neither facility is busy more than one-quarter of the time is

cont’d

20

Example 3

cont’d

21


The marginal pdf of each variable can be obtained in a manner analogous to what we did in the case of two discrete variables.

The marginal pdf of X at the value x results from holding x fixed in the pair (x, y) and integrating the joint pdf over y. Integrating the joint pdf with respect to x gives the marginal pdf of Y.

22


Definition

The marginal probability density functions of X and Y, denoted by fX (x) and fY (y), respectively, are given by

23

Independent Random Variables

24


In many situations, information about the observed value of one of the two variables X and Y gives information about the value of the other variable.

In Example 1, the marginal probability of X at x = 250 was .5, as was the probability that X = 100. If, however, we are told that the selected individual had Y = 0, then X = 100 is four times as likely as X = 250.

Thus there is a dependence between the two variables. Earlier, we pointed out that one way of defining independence of two events is via the conditionP(A B) = P(A) P(B).

25


Here is an analogous definition for the independence of two rv’s.

Definition

Two random variables X and Y are said to be independent if for every pair of x and y values

p (x, y) = pX (x) pY (y) when X and Y are discrete

or

f (x, y) = fX (x) fY (y) when X and Y are continuous

If (5.1) is not satisfied for all (x, y), then X and Y are said to be dependent.

(5.1)

26


The definition says that two variables are independent if their joint pmf or pdf is the product of the two marginal pmf’s or pdf’s.

Intuitively, independence says that knowing the value of one of the variables does not provide additional information about what the value of the other variable might be.

27

Example 6

In the insurance situation of Examples 1 and 2,

p(100, 100) = .10 (.5)(.25) = pX(100) pY(100)

so X and Y are not independent.

Independence of X and Y requires that every entry in the joint probability table be the product of the corresponding row and column marginal probabilities.

28


Independence of two random variables is most useful when the description of the experiment under study suggests that X and Y have no effect on one another.

Then once the marginal pmf’s or pdf’s have been specified, the joint pmf or pdf is simply the product of the two marginal functions. It follows that

P (a X b, c Y d) = P (a X b) P (c Y d)

29

5.2 Expected Values, Covariance, and Correlation

30

Expected Values, Covariance, and Correlation

Proposition

Let X and Y be jointly distributed rv’s with pmf p(x, y) or pdf f (x, y) according to whether the variables are discrete or continuous.

Then the expected value of a function h(X, Y), denoted by E[h(X, Y)] or h(X, Y), is given by

if X and Y are discrete

if X and Y are continuous

31

Example 13

Five friends have purchased tickets to a certain concert. If the tickets are for seats 1–5 in a particular row and the tickets are randomly distributed among the five, whatis the expected number of seats separating any particular two of the five?

Let X and Y denote the seat numbers of the first and second individuals, respectively. Possible (X, Y) pairs are {(1, 2), (1, 3), . . . , (5, 4)}, and the joint pmf of (X, Y) is

x = 1, . . . , 5; y = 1, . . . , 5; x y

otherwise

32

Example 13

The number of seats separating the two individuals is h(X, Y) = | X – Y | – 1.

The accompanying table gives h(x, y) for each possible (x, y) pair.

cont’d

33

Example 13

Thus

cont’d

34

Covariance

35

Covariance

When two random variables X and Y are not independent, it is frequently of interest to assess how strongly they are related to one another.

Definition

The covariance between two rv’s X and Y is

Cov(X, Y) = E[(X – X)(Y – Y)]

X, Y discrete

X, Y continuous

36

Covariance

That is, since X – X and Y – Y are the deviations of the two variables from their respective mean values, the covariance is the expected product of deviations. Notethat Cov(X, X) = E[(X – X)2] = V(X).

The rationale for the definition is as follows.

Suppose X and Y have a strong positive relationship to one another, by which we mean that large values of X tend to occur with large values of Y and small values of X with small values of Y.

37

Covariance

Then most of the probability mass or density will be associated with (x – X) and (y – Y), either both positive (both X and Y above their respective means) or both negative, so the product (x – X)(y – Y) will tend to be positive.

Thus for a strong positive relationship, Cov(X, Y) should be quite positive.

For a strong negative relationship, the signs of (x – X) and (y – Y) will tend to be opposite, yielding a negative product.

38

Covariance

Thus for a strong negative relationship, Cov(X, Y) should be quite negative.

If X and Y are not strongly related, positive and negative products will tend to cancel one another, yielding a covariance near 0.

39

Covariance

Figure 5.4 illustrates the different possibilities. The covariance depends on both the set of possible pairs and the probabilities. In Figure 5.4, the probabilities could be changed without altering the set of possible pairs, and this could drastically change the value of Cov(X, Y).

p(x, y) = 1/10 for each of ten pairs corresponding to indicated points:

Figure 5.4

(a) positive covariance; (b) negative covariance; (c) covariance near zero

40

Example 15

The joint and marginal pmf’s for

X = automobile policy deductible amount and

Y = homeowner policy deductible amount in Example 5.1 were

from which X = xpX(x) = 175 and Y = 125.

41

Example 15

Therefore,

Cov(X, Y) = (x – 175)(y – 125)p(x, y)

= (100 – 175)(0 – 125)(.20) + . . . + (250 – 175)(200 – 125)(.30)

= 1875

(x, y)

cont’d

42

Covariance

The following shortcut formula for Cov(X, Y) simplifies the computations.

Proposition

Cov(X, Y) = E(XY) – X Y

According to this formula, no intermediate subtractions are necessary; only at the end of the computation is X Y subtracted from E(XY). The proof involves expanding (X – X)(Y – Y) and then taking the expected value of each term separately.

43

Correlation

44

Correlation

Definition

The correlation coefficient of X and Y, denoted by Corr(X, Y), X,Y, or just , is defined by

45

Example 17

It is easily verified that in the insurance scenario of Example 15, E(X2) = 36,250,

= 36,250 – (175)2 = 5625,

X = 75, E(Y2) = 22,500,

= 6875, and Y = 82.92.

This gives

46

Correlation

The following proposition shows that remedies the defectof Cov(X, Y) and also suggests how to recognize the existence of a strong (linear) relationship.

Proposition

1. If a and c are either both positive or both negative,

Corr(aX + b, cY + d) = acCorr(X, Y)

2. For any two rv’s X and Y, –1 Corr(X, Y) 1.

47

Correlation

If we think of p(x, y) or f(x, y) as prescribing a mathematical model for how the two numerical variables X and Y are distributed in some population (height and weight, verbal SAT score and quantitative SAT score, etc.), then is a population characteristic or parameter that measures how strongly X and Y are related in the population.

We will consider taking a sample of pairs (x1, y1), . . . , (xn, yn) from the population.

The sample correlation coefficient r will then be defined and used to make inferences about .

48

Correlation

The correlation coefficient is actually not a completely general measure of the strength of a relationship.

Proposition

1. If X and Y are independent, then = 0, but = 0 does not imply independence.

2. = 1 or –1 iff Y = aX + b for some numbers a and b with a 0.

49

Correlation

This proposition says that is a measure of the degree of linear relationship between X and Y, and only when the two variables are perfectly related in a linear manner will be as positive or negative as it can be.

A less than 1 in absolute value indicates only that the relationship is not completely linear, but there may still be a very strong nonlinear relation.

50

Correlation

Also, = 0 does not imply that X and Y are independent, but only that there is a complete absence of a linear relationship. When = 0, X and Y are said to be uncorrelated.

Two variables could be uncorrelated yet highly dependentbecause there is a strong nonlinear relationship, so becareful not to conclude too much from knowing that = 0.

51

Correlation

A value of near 1 does not necessarily imply that increasing the value of X causes Y to increase. It implies only that large X values are associated with large Y values.

For example, in the population of children, vocabulary size and number of cavities are quite positively correlated, but it is certainly not true that cavities cause vocabularyto grow.

Instead, the values of both these variables tend to increase as the value of age, a third variable, increases.

52

5.3 Statistics and Their Distributions

53

Statistics and Their Distributions

Definition

A statistic is any quantity whose value can be calculated from sample data. Prior to obtaining data, there is uncertainty as to what value of any particular statistic will result. Therefore, a statistic is a random variable and will be denoted by an uppercase letter; a lowercase letter is used to represent the calculated or observed value of the statistic.

54


Thus the sample mean, regarded as a statistic (before a sample has been selected or an experiment carried out), is denoted by ; the calculated value of this statistic is .

Similarly, S represents the sample standard deviation thought of as a statistic, and its computed value is s.

If samples of two different types of bricks are selected and the individual compressive strengths are denoted by X1, . . . , Xm and Y1, . . . , Yn, respectively, then the statistic , the difference between the two sample mean compressive strengths, is often of great interest.

55


The probability distribution of a statistic is sometimes referred to as its sampling distribution to emphasize that it describes how the statistic varies in value across all samples that might be selected.

56

Random Samples

57

Random Samples

Definition

The rv’s X1, X2, . . . , Xn are said to form a (simple) random

sample of size n if

1. The Xi’s are independent rv’s.

2. Every Xi has the same probability distribution.

58

Random Samples

Conditions 1 and 2 can be paraphrased by saying that the Xi’s are independent and identically distributed (iid).

If sampling is either with replacement or from an infinite (conceptual) population, Conditions 1 and 2 are satisfied exactly.

These conditions will be approximately satisfied if sampling is without replacement, yet the sample size n is much smaller than the population size N.

59

Random Samples

In practice, if n/N .05 (at most 5% of the population is sampled), we can proceed as if the Xi’s form a random sample.

The virtue of this sampling method is that the probability distribution of any statistic can be more easily obtained than for any other sampling method.

There are two general methods for obtaining information about a statistic’s sampling distribution. One method involves calculations based on probability rules, and the other involves carrying out a simulation experiment.

60

Simulation Experiments

61


The following characteristics of an experiment must be specified:

1. The statistic of interest ( , S, a particular trimmed mean, etc.)

2. The population distribution (normal with = 100 and = 15, uniform with lower limit A = 5 and upper limit B = 10,etc.)

3. The sample size n (e.g., n = 10 or n = 50)

4. The number of replications k (number of samples to be obtained)

62


Then use appropriate software to obtain k different random samples, each of size n, from the designated population distribution.

For each sample, calculate the value of the statistic and construct a histogram of the k values. This histogram gives the approximate sampling distribution of the statistic.

The larger the value of k, the better the approximation will tend to be (the actual sampling distribution emerges as k ). In practice, k = 500 or 1000 is usually sufficient if the statistic is “fairly simple.”

63


The final aspect of the histograms to note is their spread relative to one another.

The larger the value of n, the more concentrated is the sampling distribution about the mean value. This is why the histograms for n = 20 and n = 30 are based on narrower class intervals than those for the two smaller sample sizes.

For the larger sample sizes, most of the values are quite close to 8.25. This is the effect of averaging. When n is small, a single unusual x value can result in an value far from the center.

64


With a larger sample size, any unusual x values, when averaged in with the other sample values, still tend to yield an value close to .

Combining these insights yields a result that should appeal to your intuition:

based on a large n tends to be closer to than does

based on a small n.

65Copyright © Cengage Learning. All rights reserved.

5.4 The Distribution of the Sample Mean

66

The Distribution of the Sample Mean

The importance of the sample mean springs from its use in drawing conclusions about the population mean . Some of the most frequently used inferential procedures are based on properties of the sampling distribution of .

A preview of these properties appeared in the calculations and simulation experiments of the previous section, where we noted relationships between E( ) and and also among V( ), 2, and n.

67


PropositionLet X1, X2, . . . , Xn be a random sample from a distribution with mean value and standard deviation . Then

1.

2.

In addition, with T0 = X1+ . . . + Xn (the sample total),

68


The sampling distribution of is centered precisely at the mean of the population

The distribution becomes more concentrated about as the sample size n increases.

The distribution of To becomes more spread out as n increases.Averaging moves probability in toward the middle, whereas totaling spreads probability out over a wider and wider range of values.

The standard deviation is often called the standard error of the mean

69

Example 24

In a notched tensile fatigue test on a titanium specimen, the expected number of cycles to first acoustic emission (used to indicate crack initiation) is = 28,000, and the standard deviation of the number of cycles is = 5000.

Let X1, X2, . . . , X25 be a random sample of size 25, where each Xi is the number of cycles on a different randomly selected specimen.

Then the expected value of the sample mean number of cycles until first emission is E( )= = 28,000, and the expected total number of cycles for the 25 specimens isE(To) = n = 25(28,000) = 700,000.

70

Example 24

The standard deviation of (standard error of the mean) and of To are

If the sample size increases to n = 100, E( ) is unchanged, but = 500, half of its previous value (the sample size must be quadrupled to halve the standard deviation of ).

cont’d

71

The Case of a Normal Population Distribution

72


PropositionLet X1, X2, . . . , Xn be a random sample from a normal distribution with mean and standard deviation . Then for any n, is normally distributed (with mean and standard deviation , as is To (with mean n and standard

Deviation ).

We know everything there is to know about the and To distributions when the population distribution is normal. In particular, probabilities such as P(a b) and P(c To d) can be obtained simply by standardizing.

73


Figure 5.14 illustrates the proposition.

A normal population distribution and sampling distributions

Figure 5.14

74

Example 25

The time that it takes a randomly selected rat of a certain subspecies to find its way through a maze is a normally distributed rv with = 1.5 min and = .35 min. Suppose five rats are selected.

Let X1, . . . , X5 denote their times in the maze. Assuming the Xi’s to be a random sample from this normal distribution, what is the probability that the total time To = X1 + . . . + X5 for the five is between 6 and 8 min?

75

Example 25

By the proposition, To has a normal distribution with = n = 5(1.5) = 7.5

and

variance = n 2 = 5(.1225) = .6125, so = .783.

To standardize To, subtract and divide by :

cont’d

76

Example 25

Determination of the probability that the sample average time (a normally distributed variable) is at most 2.0 min requires = = 1.5 and = = .1565.

Then

cont’d

77

The Central Limit Theorem

78


When the Xi’s are normally distributed, so is for every sample size n.

Even when the population distribution is highly nonnormal, averaging produces a distribution more bell-shaped than the one being sampled.

A reasonable conjecture is that if n is large, a suitable normal curve will approximate the actual distribution of . The formal statement of this result is the most important theorem of probability.

79


TheoremThe Central Limit Theorem (CLT)

Let X1, X2, . . . , Xn be a random sample from a distribution with mean and variance

2. Then if n is sufficiently large, has approximately a normal distribution with and and To also has approximately a normal distribution with The larger the value of n, the better the approximation.

80


Figure 5.15 illustrates the Central Limit Theorem.

The Central Limit Theorem illustrated

Figure 5.15

81

Example 26

The amount of a particular impurity in a batch of a certain chemical product is a random variable with mean value 4.0 g and standard deviation 1.5 g.

If 50 batches are independently prepared, what is the (approximate) probability that the sample average amount of impurity is between 3.5 and 3.8 g?

According to the rule of thumb to be stated shortly, n = 50 is large enough for the CLT to be applicable.

82

Example 26

then has approximately a normal distribution with mean value = 4.0 and

so

cont’d

83


The CLT provides insight into why many random variables have probability distributions that are approximately normal.

For example, the measurement error in a scientific experiment can be thought of as the sum of a number of underlying perturbations and errors of small magnitude.

A practical difficulty in applying the CLT is in knowing when n is sufficiently large. The problem is that the accuracy of the approximation for a particular n depends on the shape of the original underlying distribution being sampled.

84


If the underlying distribution is close to a normal density curve, then the approximation will be good even for a small n, whereas if it is far from being normal, then a large n will be required.

Rule of ThumbIf n > 30, the Central Limit Theorem can be used.

There are population distributions for which even an n of 40 or 50 does not suffice, but such distributions are rarely encountered in practice.

85


On the other hand, the rule of thumb is often conservative; for many population distributions, an n much less than 30 would suffice.

For example, in the case of a uniform population distribution, the CLT gives a good approximation for n 12.

87

The Distribution of a Linear Combination

The sample mean X and sample total To are special cases of a type of random variable that arises very frequently in statistical applications.

Definition

Given a collection of n random variables X1, . . . , Xn and n numerical constants a1, . . . , an, the rv

is called a linear combination of the Xi’s.

(5.7)

88


For example, 4X1 – 5X2 + 8X3 is a linear combination of X1, X2, and X3 with a1 = 4, a2 = –5, and a3 = 8.

Taking a1 = a2 = . . . = an = 1 gives Y = X1 + . . . + Xn = To,

and a1 = a2 = . . . = an = yields

89


Proposition

Let X1, X2, . . . , Xn have mean values 1, . . . , n, respectively, and variances respectively.

1. Whether or not the Xi’s are independent,

E(a1X1 + a2X2 + . . . + anXn) = a1E(X1) + a2E(X2) + . . . + anE(Xn)

= a11 + . . . + ann

2. If X1, . . . , Xn are independent,

V(a1X1 + a2X2 + . . . + anXn)

(5.8)

(5.9)

90


And

3. For any X1, . . . , Xn,

(5.10)

(5.11)

91

Example 29

A gas station sells three grades of gasoline: regular, extra, and super.

These are priced at $3.00, $3.20, and $3.40 per gallon, respectively.

Let X1, X2, and X3 denote the amounts of these grades purchased (gallons) on a particular day.

Suppose the Xi’s are independent with 1 = 1000, 2 = 500, 3 = 300, 1 = 100, 2 = 80, and 3 = 50.

92

Example 29

The revenue from sales is Y = 3.0X1 + 3.2X2 + 3.4X3, and

E(Y) = 3.01 + 3.22 + 3.43

= $5620

cont’d

93

The Difference Between Two Random Variables

94

The Difference Between Two Random Variables

An important special case of a linear combination results from taking n = 2, a1 = 1, and a2 = –1:

Y = a1X1 + a2X2 = X1 – X2

We then have the following corollary to the proposition.

Corollary

E(X1 – X2) = E(X1) – E(X2) for any two rv’s X1 and X2.V(X1 – X2) = V(X1) + V(X2) if X1 and X2 are independent rv’s.

95

Example 30

A certain automobile manufacturer equips a particular model with either a six-cylinder engine or a four-cylinder engine.

Let X1 and X2 be fuel efficiencies for independently and randomly selected six-cylinder and four-cylinder cars, respectively. With 1 = 22, 2 = 26, 1 = 1.2, and 2 = 1.5,

E(X1 – X2) = 1 – 2

= 22 – 26

= –4

96

Example 30

If we relabel so that X1 refers to the four-cylinder car, then E(X1 – X2) = 4, but the variance of the difference is still 3.69.

cont’d

97

The Case of Normal Random Variables

98


When the Xi’s form a random sample from a normal distribution, X and To are both normally distributed. Here is a more general result concerning linear combinations.

Proposition

If X1, X2, . . . , Xn are independent, normally distributed rv’s (with possibly different means and/or variances), then any linear combination of the Xi’s also has a normal distribution. In particular, the difference X1 – X2 between two independent, normally distributed variables is itself normally distributed.

99


The CLT can also be generalized so it applies to certain linear combinations. Roughly speaking, if n is large and no individual term is likely to contribute too much to the overall value, then Y has approximately a normal distribution.

5 joint probability distributions and random samples

Documents