special continuous random variables
TRANSCRIPT
Special continuous
random variables
1. Uniform distribution
2. Normal probability distributions
A RANDOM VARIABLE X WHOSE DISTRIBUTION
HAS THE SHAPE OF A NORMAL CURVE IS CALLED
A NORMAL RANDOM VARIABLE.
This random variable X is said to be normally distributed with
mean μ and standard deviation σ if its probability distribution is
given by
PROPERTIES OF A
NORMAL DISTRIBUTION
The normal curve is symmetrical about the mean μ;
The mean is at the middle and divides the area into halves;
The total area under the curve is equal to 1;
It is completely determined by its mean and standard
deviation σ (or variance σ2)
Note:
In a normal distribution, only 2 parameters are needed,
namely μ and σ2.
AREA UNDER THE NORMAL
CURVE USING INTEGRATION
The probability of a continuous normal variable X found in a
particular interval [a, b] is the area under the curve bounded
by x=a and x=b and is given by
and the area depends upon the values of μ and σ.
THE STANDARD NORMAL
DISTRIBUTION
It makes life a lot easier for us if we standardize our normal curve, with
a mean of zero and a standard deviation of 1 unit.
If we have the standardized situation of μ = 0 and σ = 1, then we
have:
We can transform all the observations of any normal random
variable X with mean μ and variance σ to a new set of observations of
another normal random variable Z with mean 0 and variance 1 using
the following transformation:
EXAMPLE
Say μ=2 and σ=1/3 in a normal distribution.
The graph of the normal distribution is as follows:
The following graph (that we also saw earlier) represents the same information, but it has been standardized so that μ = 0 and σ = 1 (with the above graph superimposed for comparison):
The two graphs have different μ and σ, but have the same area.
The new distribution of the normal random variable Z with mean 0 and variance 1 (or standard deviation 1) is called a standard normal distribution. Standardizing the distribution like this makes it much easier to calculate probabilities.
MEAN
The mean is the central tendency of the distribution. It defines the
location of the peak for normal distributions. Most values cluster
around the mean. On a graph, changing the mean shifts the entire
curve left or right on the X-axis.
STANDARD DEVIATION
The standard deviation is a measure of variability. It defines the
width of the normal distribution. The standard deviation
determines how far away from the mean the values tend to fall. It
represents the typical distance between the observations and the
average.
Unfortunately, population parameters are usually unknown
because it’s generally impossible to measure an entire population.
However, you can use random samples to calculate estimates of
these parameters.
Statisticians represent sample estimates of these parameters
using x̅ for the sample mean and s for the sample standard
deviation.
POPULATION
In statistics, a population is the complete set of all objects or
people of interest. Typically, studies definite their population of
interest at the outset. Populations can have a finite size but
potentially very large size.
For example,
All valves produced by a specific manufacturing plant
All adult females in Ukraine
All smokers
Populations can also have an infinite size. For example, infinite
populations are used for all possible results of a sequence of
trials, such as flipping a coin.
COMMON PROPERTIES FOR ALL
FORMS OF THE NORMAL
DISTRIBUTION
They’re all symmetric. The normal distribution cannot
model skewed distributions.
The mean, median, and mode are all equal.
Half of the population is less than the mean and half is greater than
the mean.
The Empirical Rule allows you to determine the proportion of
values that fall within certain distances from the mean.
MEDIAN
The median is the middle of the data. Half of the observations are less
than or equal to it and half of the observations are greater than or
equal to it. The median is equivalent to the second quartile or the 50th
percentile.
For example, if the weights of five apples are 5, 5, 6, 7, and 8, the
median apple weight is 6 because it is the middle value. If there is an
even number of observations, you take the average of the two middle
values.
MODE The mode is the value that occurs most frequently in a set of
observations. You can find the mode simply by counting the
number of times each value occurs in a data set.
For example, if the weights of five apples are 5, 5, 6, 7, and 8, the
apple weight mode is 5 because it is the most frequent value.
Identifying the mode can help you understand your distribution.
THE EMPIRICAL RULE FOR THE
NORMAL DISTRIBUTION
When you have normally distributed data, the standard deviation becomes
particularly valuable. You can use it to determine the proportion of the values
that fall within a specified number of standard deviations from the mean. For
example, in a normal distribution, 68% of the observations fall within +/- 1
standard deviation from the mean. This property is part of the Empirical Rule,
which describes the percentage of the data that fall within specific numbers of
standard deviations from the mean for bell-shaped curves.
Mean +/-
standard
deviations
Percentage of
data contained
1 68%
2 95%
3 99.7%
RANGE
Let’s start with the range because it is the most straightforward
measure of variability to calculate and the simplest to understand.
The range of a dataset is the difference between the largest and
smallest values in that dataset. For example, in the two datasets
below, dataset 1 has a range of 20 – 38 = 18 while dataset 2 has
a range of 11 – 52 = 41. Dataset 2 has a broader range and,
hence, more variability than dataset 1.
THE INTERQUARTILE RANGE (IQR) . . .
AND OTHER PERCENTILES
The interquartile range is the middle half of the data. To visualize
it, think about the median value that splits the dataset in half.
Similarly, you can divide the data into quarters. Statisticians refer
to these quarters as quartiles and denote them from low to high
as Q1, Q2, and Q3. The lowest quartile (Q1) contains the quarter
of the dataset with the smallest values. The upper quartile (Q4)
contains the quarter of the dataset with the highest values. The
interquartile range is the middle half of the data that is in between
the upper and lower quartiles. In other words, the interquartile
range includes the 50% of data points that fall between Q1 and
Q3.
Suppose that Mr. N is one of the company's clients, and exactly 20% of the
clients are older than Mr. N. How old is Mr. N?
Since 20% of the clients are older than Mr. N, the age of Mr. N is the 80th
percentile of the r.v. X. Therefore, if we use c to denote the age of Mr. N,
we have that F (c) = 0.80. Then:
Therefore, Mr. N is 53.42 years old.
WHY THE NORMAL
DISTRIBUTION IS IMPORTANT
Some statistical hypothesis tests assume that the data follow a
normal distribution. However, there’s more to it than only whether
the data are normally distributed.
Linear and nonlinear regression both assume that
the residuals follow a normal distribution.
The central limit theorem states that as the sample size increases,
the sampling distribution of the mean follows a normal distribution
even when the underlying distribution of the original variable is
non-normal.
Parametric tests of means Nonparametric tests of
medians
1-sample t-test 1-sample Sign, 1-sample
Wilcoxon
2-sample t-test Mann-Whitney test
One-Way ANOVA Kruskal-Wallis,
Mood’s median test
Factorial DOE with a factor and
a blocking variable Friedman test
ADVANTAGES OF
PARAMETRIC TESTS
Advantage 1: Parametric tests can provide trustworthy results with
distributions that are skewed and nonnormal
Many people aren’t aware of this fact, but parametric analyses
can produce reliable results even when your continuous data are
nonnormally distributed. You just have to be sure that your sample
size meets the requirements for each analysis in the table below.
Simulation studies have identified these requirements.
Parametric analyses Sample size requirements for nonnormal
data
1-sample t-test Greater than 20
2-sample t-test Each group should have more than 15
observations
One-Way ANOVA
•For 2-9 groups, each group should have
more than 15 observations
•For 10-12 groups, each group should have
more than 20 observations
ADVANTAGE 2: PARAMETRIC TESTS CAN
PROVIDE TRUSTWORTHY RESULTS WHEN THE
GROUPS HAVE DIFFERENT AMOUNTS OF
VARIABILITY
It’s true that nonparametric tests don’t require data that are
normally distributed. However, nonparametric tests have the
disadvantage of an additional requirement that can be very hard
to satisfy. The groups in a nonparametric analysis typically must
all have the same variability (dispersion). Nonparametric analyses
might not provide accurate results when variability differs
between groups.
Conversely, parametric analyses, like the 2-sample t-test or one-
way ANOVA, allow you to analyze groups that have unequal
variances. In most statistical software, it’s as easy as checking the
correct box! You don’t have to worry about groups having different
amounts of variability when you use a parametric analysis.
ADVANTAGE 3: PARAMETRIC TESTS
HAVE GREATER STATISTICAL POWER
In most cases, parametric tests have more power. If
an effect actually exists, a parametric analysis is more likely to
detect it.