prob. review

Review of Probability

References

• Hogg, R. V., and A. T. Craig, 1995. Introduction to Mathematical Statistics. Prentice-Hall.

• J. H. Stock and M. W. Watson, 2003. Introduction to Econometrics. Addison-Wesley. (Chapter

2 - advanced undergraduate level book)

• E. Zivot, 2002. Lecture Notes on Applied Econometric Modeling in Finance. The web link is:

http://faculty.washington.edu/ezivot/econ483/483notes.htm

• W. H. Greene, 2000. Econometric Analysis. Prentice Hall. (Chapters 3, 4 - introductory graduate

level book)

1 Random variables and Probability Distribution

We view the observation on some aspect of the economy (or real life) as the outcome of a random

experiment. The probability of an outcome is the proportion of the time that the outcome occurs in

the long run. If the probability of your computer not crashing while you are doing a problem set is 90

%, then over the course of doing many problem sets, you will complete 90 % without a crash.

The set of all possible outcomes is called the sample space, denoted SX . An event is a subset of

the sample space, i.e an event is a set of one or more outcomes.

A random variable is a numerical summary of a random outcome. Some random variables are discrete

and some are continuous.

Definition A random variable X is a variable that can take on a given set of values, called the

sample space and denoted SX , where the likelihood of the values in SX is determined by X’s probability

distribution (pdf).

1.1 Probability Distribution of a Discrete Random Variable

The probability distribution of a discrete random variable, denoted f(x), is the list of all possible

values of the variable and the probability that each value will occur, i.e. f(x) = Pr(X = x). The pdf

mus satisfy (i) f(x) ≥ 0 for all x ∈ SX ; (ii) f(x) = 0 for all x /∈ SX ; and (iii)∑

x∈SXf(x) = 1.

The cumulative probability distribution (c.d.f), denoted F , is the probability that the random

variable is less than or equal to a particular value:

F (x) = Pr(X ≤ x), −∞ ≤ x ≤ ∞ (1)

The cdf has the following properties:

1

1. If x1 < x2 then F (x1) ≤ F (x2)

2. F (−∞) = 0 and F (∞) = 1

3. Pr(X > x) = 1 − F (x)

4. Pr(x1 < X ≤ x2) = F (x2) − F (x1)

Table 1: Probability of Your Computer Crashing M Times

Outcome (number of crashes)

0 1 2 3 4

Probability distribution 0.80 0.10 0.06 0.03 0.01

Cumulative prob. distribution 0.80 0.90 0.96 0.99 1.00

An important special case of a discrete random variable is binary, i.e. outcomes are 0 and 1. A binary

random variable is called a Bernoulli random variable and its probability is called the Bernoulli

distribution.

1.2 Probability Distribution of a Continuous Random Variable

Unlike discrete random variable, a continuous random variable takes on a continuum of possible values

and it is not possible to list the probability of each possible value of random variables. The probability

is summarized by the probability density function (p.d.f.), denoted f(x). The area under the

probability function between any two points is the probability that the random variable falls between

those two points.

Formally, the pdf of a continuous random variable X is a nonnegative function f(x), defined on the

real line, such that for any interval A:

Pr(X ∈ A) =

∫

A

f(x)d(x) (2)

That is, Pr(X ∈ A) is the area under the probability curve over the interval A. The pdf f(x) must

satisfy (i) f(x) ≥ 0; and∫ ∞−∞ f(x)dx = 1.

Example 1

Let the random variable X of the continuous type have the pdf f(x) = 2/x3, 1 < x < ∞, zero

elsewhere. The distribution function of X is

2

F (x) =

∫ ∞

−∞0dw = 0, x < 1

=

∫ x

1

2

wdw = 1 − 1

x2, 1 ≥ x

Exposure for all stock for all periods− market capitalization of each company− B/M ratio for each company− E/P ratio, D/P ratio− leverage ratios− industry and sector dummies

Factor returns Ftt=1T Factor exposures B

tt=1T

Estimate βi, i=1,...,N

Cross−sectional regressionsTime−series regressions

Estimate Ft, t=1,...,T

BT

ΩT

− market returns− size mimicking portfolio return (FF 92)− value mimicking portfolio return− industry mimicking portfolio return− momentum mimicking portfolio return, etc

ΩT

Portfolio variance:V

pT=w

T’B

TΩ

TB

T’w

T+w

T’D

Tw

T

Data

BT

Portfolio variance:V

pT=w

T’B

TΩ

TB

T’w

T+w

T’D

Tw

T

Figure 1: Pdf and Cdf of random variable in Example 1.

Example 2

let f(x) = 1

2, −1 < x < 1, zero elsewhere, be the pdf of the random variable X. Define the random

variable Y by Y = X2. We wish to find the pdf of Y. If y ≥ 0, the probability Pr(Y ≤ y) is equivalent

to

Pr(X2 ≤ y) = Pr(−√y ≤ X ≤ sqrty) (3)

Accordingly, the distribution function of Y, G(y) = Pr(Y ≤ y), is given by

G(y) = 0, y < 0

3

=

∫

√y

−√y

1

2dx =

√y, 0 ≤ y < 1

= 1, 1 ≤ y

Since Y is a random variable of the continuous type, the pdf of Y is g(y) = G′(y) at all point of

continuity of g(y). Therefore,

g(y) =1

2√

y, 0 < y < 1

= 0, elsewhere

2 Expected Values, Mean, and Variance

The expected value of a random variable Y, denoted E(Y ), is the long-run average value of the

random variable over many repeated trials or occurrences. The expected value of Y is also called the

expectation of Y or the mean of Y.

The terminology of expectation or expected value has its origin in games of chance. This can be

illustrated as follows: four similar chips, numbered 1,1,1, and 2, respectively, are placed in a bowl and

are mixed. A player is blindfolded and is to draw a chip from the bowl. If she draws one the three chips

numbered 1, she will receive one dollar. If she draws the chip numbered 2, she receive two dollars. It

seems reasonable to assume that the players has a ” 3

4claim” on the $1 and a ” 1

4claim” on the $2. Her

”total claim” is 1 × 3

4+ 2 × 1

4= 1.25. Thus the expectation of X is precisely the player’s claim in this

game.

Suppose that the random variable Y takes on k possible outcomes, y1, y2, ..., yk, where y1 denotes the

first value, y2 denotes the second value, etc., and the probability that Y takes on y1 is p1, the probability

that Y takes on y2 is p2, and so forth. The expected value of Y is:

E(Y ) = y1p1 + y2p2 + ... + ykpk =

k∑

i=1

yipi (4)

The expected value of a continuous random variable is

E[x] =

∫ b

a

xf(x)dx (5)

where f(x) is probability density function and x takes on values between points a and b.

As an example, consider the number of computer crashes M with the probability distribution given

in Table 1. The expected value of M is the average number of crashes over many problem sets, weighted

by the frequency with which a crash of a given size occurs:

4

E(M) = 0 × 0.80 + 1 × 0.10 + 2 × 0.06 + 3 × 0.03 + 4 × 0.01 = 0.35 (6)

That is the expected number of computer crashes while doing a particular problem set is 0.35. Obviously,

the actual number of crashes is an integer. The calculation above just means that the average number

of crashes over many problem sets is 0.35.

Example 3

Let the random variable X of the discrete type have the pdf given by the table

x 1 2 3 4

f(x) 4

10

1

10

3

10

2

10

Here f(x) = 0 if x is not equal to one of the first four positive integers. This illustrates the fact that

there is no need to have a formula to describe a pdf. We have

E(X) = 1 × 4

10+ 2 × 1

10+ 3 × 3

10+ 4 × 2

10= 2.3 (7)

Example 4

Let X have the pdf:

f(x) = 4x3, 0 < x < 1

= 0 elsewhere

Then

E(X) =

∫

1

0

x(4x3)dx =

∫

1

0

4x4dx =

[

4x5

5

]1

0

=4

5(8)

2.1 Variance, Standard Deviation, Moments, Skewness and Kurtosis

The variance and standard deviation measure the dispersion or the ”spread” of a probability distribution.

The variance of random variable Y is the expected value of the square of the deviation of Y from its

mean:

V ar(Y ) = E(Y − E(Y ))2

=∑

y

(y − µY )2f(y), if y is discrete

=

∫

y

(y − µY )2f(y)dy if y is continuous

5

The standard deviation is the square root of the variance.

The mean of Y, E(y), is also called the first moment of Y, and the expected value of square of Y,

E(Y 2), is also called the second moment of Y. In general, the expected value of Y r is called the r-th

moment of the random variable Y.

Example 5

Let X have the pdf

f(x) =1

2(x + 1), −1 < x < 1

= 0 elsewhere

Then the mean value of X is

µ =

∫ ∞

−∞xf(x)dx =

∫

1

−1

xx + 1

2dx =

1

3(9)

while the variance of X is

σ2 =

∫ ∞

−∞x2f(x)dx − µ2 =

∫

1

−1

x2x + 1

2dx − (

1

3)2 =

2

9(10)

Example 6

It is known that the series

1

12+

1

22+

1

32+ ... (11)

converges to π2/6. Then

f(x) =6

π2x2, x = 1, 2, 3, ...,

= 0 elsewhere (12)

is the pdf of a discrete type of random variable.

The skewness of a random variable X, denoted skew(X), measures the symmetry of a distribution

about its mean. The skewness is defined:

skew(X) =E[(X − µX)3]

σ3

X

=

∑

x∈SX(x − µX)3Pr(X = x)

σ3

X

, for a discrete random variable

=

∫ ∞−∞(x − µX)3f(x)dx

σ3

X

, for a continuous random variable

6

If the random variable X has a symmetric distribution then skew(X) = 0. If skew(X) > 0 then the

distribution X has a long right tail (positive value are more likely than negative ones) and if skew(X) < 0

the distribution of X has a long left tail.

The kurtosis of a random variable X, denoted kurt(X), measures the thickness in the tails of a

distribution. The kurtosis is defined:

kurt(X) =E[(X − µX)4]

σ4

X

=

∑

x∈SX(x − µX)4Pr(X = x)

σ4

X

, for a discrete random variable

=

∫ ∞−∞(x − µX)4f(x)dx

σ4

X

, for a continuous random variable

The normal distribution has a kurtosis of 3. If a distribution has a kurtosis greater than 3 then the

distribution has thicker tails than the normal distribution and if a distribution has kurtosis less than 3

then the distribution has thinner tails than the normal distribution.

2.2 Mean and Variance of a Linear Function of a Random Variable

Suppose that after-tax earnings Y are related to pre-tax earnings X by the equation:

Y = 2, 000 + 0.8X (13)

where $2, 000 is the amount of grant. Suppose an individual pre-tax earnings next year are a random

variable with mean µX and variance σ2

X . Since pre-tax earnings are random, so are after-tax earnings.

The mean and variance of Y are as follows:

E(Y ) = µY = 2, 000 + 0.8µX

σ2

Y = 0.82σ2

X (14)

In general, if Y depends on X with and intercept a and a slope b, so that:

Y = a + bX (15)

Then the mean and variance of Y are:

µY = a + bµX

σ2

Y = b2σ2

X (16)

7

3 Two Random Variables

The joint probability distribution of two discrete random variables, X and Y , is the probability

that the random variables simultaneously take on certain values, x and y.

Consider an example of a joint distribution of two variables in Table 2. Let Y be a binary random

variable that equals one if the commute is short (less than 20 minutes) and equal zero otherwise, and

let X be a binary random variable that equals zero id it is raining and one if not. The joint distribution

is the frequency with which each of these four outcomes occur over many repeated commutes.

Table 2: Joint Distribution of Weather Conditions and Commuting Times

Rain (X=0) No Rain (X=1) Total

Long Commute (Y=0) 0.15 0.07 0.22

Short Commute (Y=1) 0.15 0.63 0.78

Total 0.30 0.70 1.00

Formally, the joint density function for two random variables X and Y , denoted f(x, y) is defined

so that:

Prob(a ≤ x ≤ b, c ≤ y ≤ d) =∑

a≤x≤b

∑

c≤y≤d

f(x, y) (17)

if x and y are discrete.

Prob(a ≤ x ≤ b, c ≤ y ≤ d) =

∫

a≤x≤b

∫

c≤y≤d

f(x, y)dydx (18)

if x and y are continuous.

The marginal probability distribution of a random variable Y is just another name for its

probability distribution. The term is used to distinguish the distribution of Y alone (the marginal

distribution) from the join distribution of Y and another random variable.

The marginal distribution of Y can be computed from join distribution of X and Y by adding up

the probabilities of all possible outcomes for which Y takes on a specified value. For example, in Table

2, the probability of a long rainy commute is 15% and the probability of a long rain with no rain is 7%,

so the probability of a long commute (rainy or not) is 22%.

Formally, to obtain the marginal distribution from the joint density, it is necessary to sum or integrate

out the other variables:

fx(x) =∑

y∈SY

f(x, y), in the discrete case

8

fx(x) =

∫

y

f(x, s)ds, in the continuous case

and similarly for fy(y).

Example 7

Let X1 and X2 have the joint pdf

f(x1, x2) = x1 + x2, 0 < x1 < 1, 0 < x2 < 1 (19)

= 0 elsewhere (20)

The marginal pdf of X1 is

f1(x1) =

∫

1

0

(x1 + x2)dx2 = x1 +1

2, 0 < x1 < 1 (21)

zero elsewhere, and the marginal pdf of X2 is

f2(x2) =

∫

1

0

(x1 + x2)dx1 = x2 +1

2, 0 < x2 < 1 (22)

zero elsewhere. A probability like Pr(X1 ≤ 1

2) can be computed from either f1(x1) or f(x1, x2) because

∫

1/2

0

∫

1

0

f(x1, x2)dx2dx1 =

∫

1/2

0

f1(x1)dx1 = 3/8 (23)

However to find a probability like Pr(X1 + X2 ≤ 1), one must use the joint pdf.

3.1 Conditional Distributions

The distribution of a random variable Y conditional on another random variable X taking on a specific

value is called the conditional distribution of Y given X.

What is the probability of a long commute (Y = 0) if you know it is raining (X = 0)? From Table 2,

the joint distribution of a rainy short commute is 15% and the joint probability of a rainy long commute

is 15%, so if it is raining a long commute and a short commute are equally likely. Thus the probability

of a long commute (Y = 0), conditional on it being rainy (X = 0) is 50%, i.e. Pr(Y = 0|X = 0) = 0.50.

In general the conditional distribution of Y given X = x is

Pr(Y = y|X = x) =Pr(X = x, Y = y)

Pr(X = x)(24)

or

f(y|x) =f(x, y)

fx(x)(25)

9

3.2 Conditional Expectation

The conditional expectation of Y given X, also called the conditional mean of Y given X,

is the man of the conditional distribution of Y given X. That is, the conditional expectation of is the

expected value of Y, computed using the conditional distribution of Y given X. If Y takes on values

y1, ..., yk, then the conditional mean of Y given X = x is

E(Y |X = x) =

k∑

i=1

Pr(Y = yi|X = x) (26)

or

E(y|x) =∑

y

yf(y|x), if y is discrete (27)

E(y|x) =

∫

y

yf(y|x)dy, if y is continuous (28)

Consider an example in Table 3. The expected number of computer crashes, given the computer is

old, is E(M |A = 0) = 0.70 + 1 × 0.13 + 2 × 0.10 + 3 × 0.05 + 4 × 0.02 = 0.56. The expected number of

computer crashes given that the computer is new, is E(M |A = 1) = 0.14.

Table 3: Joint and Conditional Distributions of Computer Crashes(M) and Computer Age (A)

A. Joint Distribution

M=0 M= 1 M= 2 M= 3 M= 4 Total

Old computer (A=0) 0.35 0.065 0.05 0.025 0.01 0.5

New computer (A=1) 0.45 0.035 0.01 0.005 0.00 0.5

Total 0.8 0.1 0.06 0.03 0.01 1.0

B. Conditional Distributions of M given A

M=0 M= 1 M= 2 M= 3 M= 4 Total

Pr(M—A=0) 0.70 0.13 0.10 0.05 0.02 1.0

Pr(M—A=1) 0.90 0.07 0.02 0.01 0.00 1.0

3.3 The law of iterated expectations

The mean of Y is the weighted average of the conditional expectation of Y given X, weighted by the

probability distribution of X. For example, the mean number of crashes M is the weighted average of

the conditional expectation of M given that it is old and the conditional expectation of M given that it is

new, so E(M) = E(M |A = 0)×Pr(A = 0)+E(M |A = 1)×Pr(A = 1) = 0.56×0.50+0.14×0.50 = 0.35.

This is the mean of the marginal distribution calculated before (Table 1)

10

Formally, the expectation of Y is the expectation of the conditional expectation of Y given X, that

is:

E(Y ) = Ex[E(Y |X)] (29)

where the notation Ex[·] indicates the expectation over the values of x.

3.4 Conditional Variance

The variance of Y conditional on X is the variance of the conditional distribution of Y given X:

V ar[y|x] = E[(y − E[y|x])2|x]

=

∫

y

(y − E[y|x])2f(y|x)dy, if y is continuous

and

V ar[y|x] =∑

y

(y − E[y|x])2f(y|x), if y is discrete (30)

3.5 Decomposition of Variance

In a joint distribution,

V ar[y] = V arx[E[y|x]] + Ex[V ar[y|x]] (31)

where the notation V arx[·] indicates the variance over the distribution of X.

3.6 Independence

Two random variables X and Y are independently distributed, or independent, if knowing the

value of one of the variables provides no information about the other. In other words, X and Y are

independent if the conditional distribution of Y given X equals the marginal distribution of Y:

Pr(Y = y|X = x) = Pr(Y = y) (independence of X and Y ) (32)

or

f(y|x) = f(y) (33)

Substitute the equation (33) into the equation (25) and one can see the joint distribution of two

independent random variables is the product of their marginal distribution.

f(x, y) = f(x)f(y) (34)

11

3.7 Covariance and Correlation

Covariance is a measure of the extent to which two random variables move together. The covariance

between X and Y is the expected value E[(X − µX)(Y − µY )], where µX is the mean of X and µY is

the mean of Y .

Cov(X,Y ) =∑

x

∑

y

(x − µX)(y − µY )f(x, y), if y is discrete

=

∫

x

∫

y

(x − µX)(y − µY )f(x, y)dydx, if x is continuous

To interpret these formulas, suppose that when X is greater than its mean (so that X − µX is

positive), then Y tends to be greater than its mean (so that Y − µY is positive), and when when X is

less than its mean (so that X − µX < 0 ), then Y tends to be less than its mean (so that Y − µY ¡0). In

both cases, the product (X − µX)(Y − µY ) > 0, so that covariance is positive and we know that that

X and Y tend to move in the same direction. When covariance is negative, X and Y tend to move in

the opposite direction.

If random variables X and Y are independent, regardless of their joint distribution, σXY = 0. Note

that the converse is not always true.

Properties of Covariance:

1. Cov(X,X) = V ar(X)

2. Cov(X,Y ) = Cov(Y,X)

3. Cov(aX, bY ) = abCov(X,Y )

4. In any bivariate distribution, Cov(X,Y ) = Cov(X,E[Y |X])

5. If X and Y are independent then Cov(X,Y ) = 0 (no association ⇒ no linear association). How-

ever, if cov(X,Y ) = 0 then X and Y are not necessarily independent.

6. If X and Y are jointly normally distributed and Cov(X,Y ) = 0, then X and Y are independent

3.8 Correlation

The correlation is an alternative measure of dependence between X and Y . The correlation between

X and Y s the covariance between X and Y , divided by their standard deviations:

Corr(X,Y ) = ρXY =Cov(X,Y )

√

V ar(X)V ar(Y )=

σXY

σXσY(35)

The random variables X and Y are said to be uncorrelated if corr(X,Y ) = 0.

Properties of Corr(X,Y ) are:

12

1. −1 ≤ ρXY ≤ 1

2. If ρXY = −1 then X and Y are perfectly negatively linearly related. That is, Y = aX + b, where

a < 0.

3. If ρXY = 1 then X and Y are perfectly positively linearly related. That is, Y = aX + b, where

a > 0.

4. If ρXY = 0 then X and Y are not linearly related but may be nonlinearly related.

5. Corr(aX, bY ) = Corr(X,Y ) if a > 0 and b > 0; Corr(aX, bY ) = −Corr(X,Y ) if a > 0, b < 0 or

a < 0,b > 0.

3.9 The Mean and Variances of Sums of Random Variables

Let X,Y , and V be random variables, let µX and σ2

X be the mean and variance of X, let σXY be the

covariance between X and Y (and so forth for the other variables), and let a, b, and c be constants. The

following facts follow from the definitions of the mean, variance and covariance:

E(a + bX + cY ) = a + bµX + cµY (36)

var(a + bY ) = b2σ2

Y

var(aX + bY ) = a2σ2

X + 2abσXY + b2σ2

y (37)

E(Y 2) = σ2

Y + µ2

Y

cov(a + bX + cV, Y ) = bσXY + cσV Y

E(XY ) = σXY + µXµY

4 The Normal, Chi-Squared, Fm,∞, and Student t Distributions

4.1 The Normal Distribution

A continuous random variable with a normal distribution has the familiar bell-shaped form. The general

form of a normal distribution with mean µ and standard deviation σ is:

f(x|µ, σ2) =1

σX

√2π

exp(− 1

2σ2

X

[x − µX ]2) (38)

This result is usually denoted x ∼ N(µX , σ2

X). The normal density is symmetric around its mean

µX and has 95% of its probability between µX − 1.96σX and µX + 1.96σX .

13

Using numerical approximations, it can be shown that:

Pr(µX − σX < X < µX + σX) ≈ 0.67

Pr(µX − 2σX < X < µX + 2σX) ≈ 0.95

Pr(µX − 3σX < X < µX + 3σX) ≈ 0.99

The standard normal distribution is the normal distribution with mean µ = 0 and variance

σ2 = 1 and is denoted N(0, 1), with density:

φ(z) =1√2π

exp(−1

2z2) (39)

The specific notation φ(z) is often used for this distribution and Φ(Z) for its cdf.

Total Risk

Systematic Risk Idiosyncratic Risk

Market Factor Size Factor Value Factor

Industry Factors Sector Factors

Figure 2: The pdf of normal distribution.

4.2 The Log-Normal Distribution

A random variable Y is said to be log-normally distributed with parameters µY and σ2

Y if

14

ln Y ∼ N(µY , σ2

Y ) (40)

Or let X ∼ N(µX , σ2

X) and define Y = exp(X). Then Y is log-normally distributed and is denoted

Y ∼ ln N(µY , σ2

Y ) and it can be shown that:

µY = E[Y ] = exp(µX +σ2

X

2)

σ2

Y = var(Y ) = exp(2µX + σ2

X)(exp(σ2) − 1)

4.3 The Multivariate Normal Distribution

The normal distribution can be generalized to describe the joint distribution of a set of random variables.

In this case, the distribution is called the multivariate normal distribution, or, if only two variables

are being considered, the bivariate normal distribution.

The multivariate normal distribution has three important properties:

1. If X and Y have a bivariate normal distribution with covariance σXY , and if a and b are two

constants, then

aX + bY ∼ N(aµX + bµY , a2σ2

X + b2σ2

Y + 2abσXY ) (41)

In general, if n random variables have a multivariate normal distribution, then any linear combi-

nation of these variables results in a random variable that is normally distributed.

2. If a set of variables has a multivariate normal distribution, then the marginal distribution of each

of the variables is normal.

3. If variables with a multivariate normal distribution have covariances that equal zero, then the

variables are independent. Therefore, if X and Y have a bivariate normal distribution and σXY =

0, then X and Y are independent.

4.4 The Chi-Squared and Fm,∞ Distribution

The chi-squared and Fm,∞ distributions are used when testing certain types of hypothesis in statistics

and econometrics.

The chi-squared distribution is the distribution of the sum of m squared independent standard

normal random variables. The distribution depends on m, which is called the degrees of freedom. For

example, let Z1, Z2, Z3 and Z4 be independent standard normal random variables. Then Z2

1+ Z2

2+

Z2

3+ Z2

4has a chi-square distribution with 4 degrees of freedom, denoted χ4.

15

−4−2

02

4

−4

−2

0

2

40

2

4

6

8

Bivariate StandardNormal

Figure 3: The pdf of bivariate standard normal distribution.

16

The pdf of chi-squared distribution with r degrees of freedom has the following form:

f(x) =1

Γ(r/2)2r/2xr/2−1e−x/2, 0 < x < ∞ (42)

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

χ2(2), µ = 2, σ2 =4

χ2(4), µ = 4, σ2 =8

χ2(6), µ = 6, σ2 =16

Figure 4: The pdf of bivariate standard normal distribution.

The Fm,∞ distribution is the distribution of a random variable with a chi-squared distribution

with m degrees of freedom divided by m. Continuing with the previous example (Z2

1+Z2

2+Z2

3+Z2

4)/4

has a F4,∞ distribution.

4.5 The Student t Distribution

The Student t distribution with m degrees of freedom is defined to be the distribution of the ratio

of a standard normal random variable, divided by the square root of an independently distributed

chi-squared random variable with m degrees of freedom divided by m.

Let Z be a standard normal random variable, i.e Z ∼ N(0, 1), let W be a random variable with a

chi-squared distribution with m degrees of freedom, i.e. W ∼ χm, and let Z and W be independently

distributed. Then the random variable Z/√

W/m has a Student t distribution with m degrees of

17

freedom, denoted tm.

5 Random Sampling and the Distribution of the Sample Aver-

age

Suppose that you have selected 20 UNCC students to measure their height in order to learn something

about the average weight of students in UNCC

Simple random sampling is a situation in which n objects are selected at random from a pop-

ulation(the population of all UNCC students) and each member of the population is equally likely to

be included.

The n observations in the sample are denoted Y1, ..., Yn, where Y1 is the first observation, Y2 is the

second observation, and so forth.

Because the members of the population included in the sample are selected at random, the values of

observations Y1, ..., Yn are themselves random. If different members of the population are choses, their

values of Y will differ.

5.1 i.i.d. draws

Because Y1, ..., Yn are randomly drawn from the same population, the marginal distribution of Yi is the

same for each i = 1, ..., n. When Yi has the same marginal distribution for i = 1, ..., n, then Y1, ..., Yn

are said to be identically distributed.

When Y1, ..., Yn are drawn from the same distribution and are independently distributed, they are

said to be independently and identically distributed, or i.i.d..

5.2 The Sampling Distribution of the Sample Average

The sampling average, Y , of the n observations Y1, ..., Yn is

Y =1

n

n∑

i=1

Yi (43)

An essential concept is that the act of drawing a random variable has the effect of making the sample

average Y a random variable. Because the sample Y1, ..., Yn is random, their average is random, i.e. the

average depends on the sample that is realized.

Because Y is random, it has a probability distribution. The distribution of Y is called the sampling

distribution of Y .

Suppose that the observations Y1, ..., Yn are i.i.d., and let µY and σ2

Y denote the mean and variance

of Yi, i = 1, ..., n. Apply the formula (36) to find that the mean of the sample average is:

18

E(Y ) = E(1

n

n∑

i=1

Yi) =1

nE(Y1 + ... + Yn) = µY (44)

Apply the formula (37) to find the variance of the sample mean:

var(Y ) = V ar(1

n

n∑

i=1

Yi) =1

n2V ar(Y1 + ... + Yn) =

σ2

n(45)

In summary, the mean, the variance and the standard deviation of Y are:

E(Y ) = µY

V ar(Y ) =σ2

Y

n

std.dev(Y ) =σY√

n

Suppose that Y1, ..., Yn are i.i.d. draws from N(µY , σ2

Y ) distribution. Using the property of multi-

variate normal distribution in (41), the sum of n normally distributed random variables is itself normally

distributed. Therefore, the sample average is normally distributed with mean µY and variance σ2

Y , i.e.

Y ∼ N(µY , σ2

Y ).

There are two approaches to characterizing sampling distributions: an ”exact” and an ”approximate”

approach.

The ”exact” approach entails deriving a formula for the sampling distribution that holds exactly for

any value n. The sampling distribution that describes the distribution of Y for an n is called the exact

distribution or finite-sample distribution of Y .

The ”approximate” approach uses approximations to the sampling distribution that rely on the

sample size being large. The large sample approximation to the sampling distribution is often called the

asymptotic distribution. The term ”asymptotic” is used to the fact that the approximations become

exact in the limit n → ∞.

There are two key tools used to approximate sampling distribution when the sample size is large:

1. The law of large numbers

2. The central limit theorem

5.3 The Law of Large Numbers and Consistency

The law of large numbers states that Y will be near µY with very high probability when n is large.

The property that Y is near µY with increasing probability as n increases is called convergence in

probability or consistency, written Yp→ µY .

19

The law of large number says that if Yi, i = 1, ..., n are independently and identically distributed

with E(Yi) = µY and var(Yi) = σ2

Y < ∞, then Yp→ µY .

Formally, the random variable ZT converges in probability to a constant c if limn→∞Pr(|Z − c| >

e) = 0 for any positive e.

5.4 The Central Limit Theorem

The central limit theorem says that, under general conditions, the distribution of Y is well approxi-

mated by a normal distribution when n is large.

Central Limit Theorem. If Y1, ..., Yn are a random sample from a probability distribution with

finite mean µY and finite variance σ2

Y and Y = 1

n

∑ni=1

Yi, then

√n(Y − µY )

d→ N(0, σ2

Y ) (46)

20

prob. review

Documents