special continuous random variables

Special continuous

random variables

1. Uniform distribution

2. Normal probability distributions

A RANDOM VARIABLE X WHOSE DISTRIBUTION

HAS THE SHAPE OF A NORMAL CURVE IS CALLED

A NORMAL RANDOM VARIABLE.

This random variable X is said to be normally distributed with

mean μ and standard deviation σ if its probability distribution is

given by

PROPERTIES OF A

NORMAL DISTRIBUTION

The normal curve is symmetrical about the mean μ;

The mean is at the middle and divides the area into halves;

The total area under the curve is equal to 1;

It is completely determined by its mean and standard

deviation σ (or variance σ2)

Note:

In a normal distribution, only 2 parameters are needed,

namely μ and σ2.

AREA UNDER THE NORMAL

CURVE USING INTEGRATION

The probability of a continuous normal variable X found in a

particular interval [a, b] is the area under the curve bounded

by x=a and x=b and is given by

and the area depends upon the values of μ and σ.

THE STANDARD NORMAL

DISTRIBUTION

It makes life a lot easier for us if we standardize our normal curve, with

a mean of zero and a standard deviation of 1 unit.

If we have the standardized situation of μ = 0 and σ = 1, then we

have:

We can transform all the observations of any normal random

variable X with mean μ and variance σ to a new set of observations of

another normal random variable Z with mean 0 and variance 1 using

the following transformation:

EXAMPLE

Say μ=2 and σ=1/3 in a normal distribution.

The graph of the normal distribution is as follows:

The following graph (that we also saw earlier) represents the same information, but it has been standardized so that μ = 0 and σ = 1 (with the above graph superimposed for comparison):

The two graphs have different μ and σ, but have the same area.

The new distribution of the normal random variable Z with mean 0 and variance 1 (or standard deviation 1) is called a standard normal distribution. Standardizing the distribution like this makes it much easier to calculate probabilities.

MEAN

The mean is the central tendency of the distribution. It defines the

location of the peak for normal distributions. Most values cluster

around the mean. On a graph, changing the mean shifts the entire

curve left or right on the X-axis.

STANDARD DEVIATION

The standard deviation is a measure of variability. It defines the

width of the normal distribution. The standard deviation

determines how far away from the mean the values tend to fall. It

represents the typical distance between the observations and the

average.

Unfortunately, population parameters are usually unknown

because it’s generally impossible to measure an entire population.

However, you can use random samples to calculate estimates of

these parameters.

Statisticians represent sample estimates of these parameters

using x̅ for the sample mean and s for the sample standard

deviation.

POPULATION

In statistics, a population is the complete set of all objects or

people of interest. Typically, studies definite their population of

interest at the outset. Populations can have a finite size but

potentially very large size.

For example,

All valves produced by a specific manufacturing plant

All adult females in Ukraine

All smokers

Populations can also have an infinite size. For example, infinite

populations are used for all possible results of a sequence of

trials, such as flipping a coin.

COMMON PROPERTIES FOR ALL

FORMS OF THE NORMAL

DISTRIBUTION

They’re all symmetric. The normal distribution cannot

model skewed distributions.

The mean, median, and mode are all equal.

Half of the population is less than the mean and half is greater than

the mean.

The Empirical Rule allows you to determine the proportion of

values that fall within certain distances from the mean.

MEDIAN

The median is the middle of the data. Half of the observations are less

than or equal to it and half of the observations are greater than or

equal to it. The median is equivalent to the second quartile or the 50th

percentile.

For example, if the weights of five apples are 5, 5, 6, 7, and 8, the

median apple weight is 6 because it is the middle value. If there is an

even number of observations, you take the average of the two middle

values.

https://statisticsbyjim.com/glossary/mean/

MODE The mode is the value that occurs most frequently in a set of

observations. You can find the mode simply by counting the

number of times each value occurs in a data set.

For example, if the weights of five apples are 5, 5, 6, 7, and 8, the

apple weight mode is 5 because it is the most frequent value.

Identifying the mode can help you understand your distribution.

THE EMPIRICAL RULE FOR THE

NORMAL DISTRIBUTION

When you have normally distributed data, the standard deviation becomes

particularly valuable. You can use it to determine the proportion of the values

that fall within a specified number of standard deviations from the mean. For

example, in a normal distribution, 68% of the observations fall within +/- 1

standard deviation from the mean. This property is part of the Empirical Rule,

which describes the percentage of the data that fall within specific numbers of

standard deviations from the mean for bell-shaped curves.

Mean +/-

standard

deviations

Percentage of

data contained

1 68%

2 95%

3 99.7%

RANGE

Let’s start with the range because it is the most straightforward

measure of variability to calculate and the simplest to understand.

The range of a dataset is the difference between the largest and

smallest values in that dataset. For example, in the two datasets

below, dataset 1 has a range of 20 – 38 = 18 while dataset 2 has

a range of 11 – 52 = 41. Dataset 2 has a broader range and,

hence, more variability than dataset 1.

THE INTERQUARTILE RANGE (IQR) . . .

AND OTHER PERCENTILES

The interquartile range is the middle half of the data. To visualize

it, think about the median value that splits the dataset in half.

Similarly, you can divide the data into quarters. Statisticians refer

to these quarters as quartiles and denote them from low to high

as Q1, Q2, and Q3. The lowest quartile (Q1) contains the quarter

of the dataset with the smallest values. The upper quartile (Q4)

contains the quarter of the dataset with the highest values. The

interquartile range is the middle half of the data that is in between

the upper and lower quartiles. In other words, the interquartile

range includes the 50% of data points that fall between Q1 and

Q3.

Suppose that Mr. N is one of the company's clients, and exactly 20% of the

clients are older than Mr. N. How old is Mr. N?

Since 20% of the clients are older than Mr. N, the age of Mr. N is the 80th

percentile of the r.v. X. Therefore, if we use c to denote the age of Mr. N,

we have that F (c) = 0.80. Then:

Therefore, Mr. N is 53.42 years old.

WHY THE NORMAL

DISTRIBUTION IS IMPORTANT

Some statistical hypothesis tests assume that the data follow a

normal distribution. However, there’s more to it than only whether

the data are normally distributed.

Linear and nonlinear regression both assume that

the residuals follow a normal distribution.

The central limit theorem states that as the sample size increases,

the sampling distribution of the mean follows a normal distribution

even when the underlying distribution of the original variable is

non-normal.

Parametric tests of means Nonparametric tests of

medians

1-sample t-test 1-sample Sign, 1-sample

Wilcoxon

2-sample t-test Mann-Whitney test

One-Way ANOVA Kruskal-Wallis,

Mood’s median test

Factorial DOE with a factor and

a blocking variable Friedman test

ADVANTAGES OF

PARAMETRIC TESTS

Advantage 1: Parametric tests can provide trustworthy results with

distributions that are skewed and nonnormal

Many people aren’t aware of this fact, but parametric analyses

can produce reliable results even when your continuous data are

nonnormally distributed. You just have to be sure that your sample

size meets the requirements for each analysis in the table below.

Simulation studies have identified these requirements.

Parametric analyses Sample size requirements for nonnormal

data

1-sample t-test Greater than 20

2-sample t-test Each group should have more than 15

observations

One-Way ANOVA

•For 2-9 groups, each group should have

more than 15 observations

•For 10-12 groups, each group should have

more than 20 observations

ADVANTAGE 2: PARAMETRIC TESTS CAN

PROVIDE TRUSTWORTHY RESULTS WHEN THE

GROUPS HAVE DIFFERENT AMOUNTS OF

VARIABILITY

It’s true that nonparametric tests don’t require data that are

normally distributed. However, nonparametric tests have the

disadvantage of an additional requirement that can be very hard

to satisfy. The groups in a nonparametric analysis typically must

all have the same variability (dispersion). Nonparametric analyses

might not provide accurate results when variability differs

between groups.

Conversely, parametric analyses, like the 2-sample t-test or one-

way ANOVA, allow you to analyze groups that have unequal

variances. In most statistical software, it’s as easy as checking the

correct box! You don’t have to worry about groups having different

amounts of variability when you use a parametric analysis.

ADVANTAGE 3: PARAMETRIC TESTS

HAVE GREATER STATISTICAL POWER

In most cases, parametric tests have more power. If

an effect actually exists, a parametric analysis is more likely to

detect it.

special continuous random variables

Documents