chapter 3 describing distributions with numbersan73773/slidesclass3.pdfchapter 3 section3-2:...
TRANSCRIPT
2/22/2009
1
1
Chapter 3Chapter 3
Section3-2: Measures of Center
Section 3-3: Measurers of Variation
Section 3-4: Measures of Relative Standing
Section 3-5: Exploratory Data Analysis
Describing Distributions with Numbers
• The overall pattern of the distribution of a quantitative
variable is described by its shape, center, and spread.
• By inspecting the histogram we can describe the shape
of the distribution, but as we saw, we can only get a
rough estimate for the center and spread.
• We need a more precise numerical description of the
center and spread of the distribution.
Center and Spread
About the same center
Wide spread
Narrow spread
Measures of Center and Spread
CENTER
• Mean
• Median
• Mode
SPREAD
• Range
• Inter-quartile range
(IQR)
• Variance/Standard
deviation
SPREAD
• Range
• Inter-quartile range
(IQR)
• Variance/Standard
deviation
Mean: the balance point
• Mean: “x-bar”
• the sum of the observations divided by the
number of observations).
If the n observations are
then
x
x x x xn1 2 3, , ,,K
xx x x x
n
x
n
n
i
i
n
=+ + + +
==
∑1 2 3 1K
Example: Mean
x x x x x1 2 3 4 55 7 3 38 7= = = = =, , , ,,
The mean is
xx x x x x
n
x
n
x hours
i
i=
+ + + +=
=+ + + +
=
=
∑1 2 3 4 5 1
5
5 7 3 38 7
512
2/22/2009
2
Median
• The median M is the midpoint of the
distribution (like the median strip in a road)
• It is the number such that
half of the observations
fall above and half fall below.
How to find the median?
1.Order the data from smallest to largest.
2. If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n+1)/2 spot in the ordered list.
2. If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered list.
Median Example: Median
• Step 1: order the data:
3 5 7 7 38
x x x x x1 2 3 4 55 7 3 38 7= = = = =, , , ,,
Median = 7
Example 2: Median
• Step 1: order the data:
3 5 7 38
x x x x1 2 3 45 7 3 38= = = =, , ,,
Median =
If the data are
5 7
26
+=
ExampleData set A:
64 65 66 68 70 71 73
• median is 68
• mean is 68.1
Data set B:
64 65 66 68 70 71 730
• median is still 68
• mean is 162
outlier
The mean is very sensitive to outliers,
while the median is resistant to outliers.
2/22/2009
3
Comparing the mean and the median
Mean
• describes the center as
an average value, where
the actual values of the
data points play an
important role
• Sensitive to outliers
Median
• locates the middle value
as the center, and the
order of the data is the
key to finding it.
• Not sensitive to outliers
Symmetric distributions with no
outliers
Mean ≈Median
Left-skewed distributions
Mean < Median
The tail
“pulls”
the mean.
Right-skewed distributions
Median< Mean
Median=68 Mean=72
The tail
“pulls”
the mean.
Which measure of center to use?
• We will therefore use the mean as a measure
of center for symmetric distributions with no
outliers.
• Otherwise, the median will be a more
appropriate measure of the center of our
data.
Another measure of center: Mode
• Mode is the most frequent value in the data set
• Data: 2, 4, 4, 4, 5, 5, 6, 7, 8, 10, 15
Mode = 4
• Data: 2, 4, 4, 4, 5, 5, 5, 6, 7, 8, 10, 15
Mode = 4, 5
• Data: 2, 4, 5, 6, 7, 8, 10, 15
No mode
2/22/2009
4
Summary: Measures of Center
• The three main numerical measures for the center of a
distribution are the mean , the median M, and the mode.
• The mean is the average value, while the median is the
middle value. The mode is the most frequent value.
• The mean is very sensitive to outliers, while the median is
resistant to outliers.
• The mean is an appropriate measure of center only for
symmetric distributions with no outliers. In all other cases,
the median should be used to describe the center of the
distribution.
x
Spread
• Spread: how far from the center the data tend
range.
• If all the data points are identical, there would
be no spread at all. Numerically, the spread
would be zero.
• Ex.: 5 5 5 5 5 5 5 5 5 5 5
Center: 5
Spread: 0
Measures of Spread Measures of Spread
• Range,
• Inter-quartile range (IQR),
• Variance / Standard deviation
These measures provide different ways to
quantify the variability of the distribution.
Range
• Range = max. value – min. value
• Example:
Data: 2, 4, 4, 4, 5, 5, 6, 7, 8, 10, 15
Range= max.- min. = 15-2 = 13
Inter-quartile range (IQR)
• the IQR gives the range covered by the
MIDDLE 50% of the ordered data
2/22/2009
5
How to find the IQR?
• Step 1: arrange the data in increasing order
• Step 2: find the median
How to find the IQR?
• Step 3: Find the median of the lower 50% of
the data. This is called the first quartile of the
distribution and is denoted by Q1.
How to find the IQR?
• Step 4: Repeat this again for the top 50% of
the data. Find the median of the top 50% of
the data. This is called the third quartile of the
distribution and is denoted by Q3.
IQR
• The middle 50% of the data falls between Q1
and Q3, and therefore: IQR = Q3 - Q1
IQR
Q1
Q3Q1
Q3
Example
• Weights of 10 students:
102, 118, 120, 136, 138, 149, 157, 157, 161, 180
M =+
=138 149
21435.Q1 Q3
IQR = 157-120 = 37
2/22/2009
6
Note
• The IQR should be used as a measure of
spread of a distribution only when the median
is used as a measure of center.
Median IQR
Using the IQR to detect outliers
• The 1.5(IQR) Criterion for Outliers
• An observation is considered a
suspected outlier if it is
– below Q1 - 1.5(IQR) or
– above Q3 + 1.5(IQR)
The 1.5(IQR) Criterion for OutliersExample 1
• Weights of 10 students:
102, 118, 120, 136, 138, 149, 157, 157, 161, 215
M =+
=138 149
21435.Q1 Q3
IQR = 157-120 = 37
Q IQR3
15 157 15 37 212 5+ = + =. . ( ) .
Anything above 212.5? YES. 215 IS an outlier.
Example 2
• Data:
-15, 8, 9, 12, 14, 19, 22, 23, 23, 45, 50
MQ1 Q3
IQR= 23-9 = 14 1.5 (IQR)= 1.5 (14) =21
Anything below Q1-1.5(IQR)=9 – 21 = -12? YES!
Anything above Q3+1.5(IQR)=23 +21 = 44? YES!
Outlier!Outliers!
Five-number summary
• To get a quick summary of both center and
spread, we consider these five numbers:
– Mininum value
– Q1
– Median Five-number summary
– Q3
– Maximum value
2/22/2009
7
Boxplot
• John Tukey invented
another kind of display
to show off the five-
number summary. It’s
called boxplot.
Example 1
• Weights of 10 students:
102, 118, 120, 136, 138, 149, 157, 157, 161, 180
M =+
=138 149
21435.Q1 Q3Min. Max.
100 110 120 130 140 150 160 170 180
Example 2
• Weights of 10 students:
102, 118, 120, 136, 138, 149, 157, 157, 161, 215
M =+
=138 149
21435.Q1 Q3Min. “New”
Max.
100 110 120 130 140 150 160 170 180
Outlier
*
M = median = 3.4
Q3= third quartile
= 4.35
Q1= first quartile
= 2.2
25 6 6.1
24 5 5.6
23 4 5.3
22 3 4.9
21 2 4.7
20 1 4.5
19 6 4.2
18 5 4.1
17 4 3.9
16 3 3.8
15 2 3.7
14 1 3.6
13 3.4
12 6 3.3
11 5 2.9
10 4 2.8
9 3 2.5
8 2 2.3
7 1 2.3
6 6 2.1
5 5 1.5
4 4 1.9
3 3 1.6
2 2 1.2
1 1 0.6
Largest = max = 6.1
Smallest = min = 0.6
Disease X
0
1
2
3
4
5
6
7
Ye
ars
un
til d
ea
th“Five-number summary”
Example 3
Interpretation
Low variability
High variability
Comparing distributions
• Boxplots are best
used for side-by-
side comparison
of more than one
distribution.
2/22/2009
8
Summary• Measures of the center of distributions:
– Mean
– Median
– Mode
• Measures of spread of distributions:
– Range
– IQR
• Using IQR to detect outliers—the 1.5(IQR) rule
• Boxplots
• Variance/Standard deviation
43
Variance and Standard Deviation:
The idea
• The standard
deviation gives the
average (or typical
distance) between
a data point and
the mean, .x
44
The formulas
• We have n observations:
• Variance:
• Standard deviation:
sx x x x x x
n
n2 1
2
2
2 2
1=
− + − + + −
−
( ) ( ) ( )K
x x xn1 2, , ,K
sx x x x x x
n
n=
− + − + + −
−
( ) ( ) ( )1
2
2
2 2
1
K
45
Example: Video Store Customers
• The following are the number of customers
who entered a video store in 8 consecutive
hours: 7, 9, 5, 13, 3, 11, 15, 9
• Find the mean and the standard deviation of the
distribution.
Hour 1st 2nd 3rd 4th 5th 6th 7th 8th
# of customers 7 9 5 13 3 11 15 9
x x x x x x x x1 2 3 4 5 6 7 87 9 5 13 3 11 15 9= = = = = = = =, , , , , , ,
46
Dotplot and the Mean
Mean = 9
x =+ + + + + + +
=7 9 5 13 3 11 15 9
89
47
Standard deviation: Steps 1, 2, and 3Observations Deviations Squared deviations
7 7 – 9 = -2 (-2) 2 = 4
9 9 – 9 = 0 (0) 2 = 0
5 5 – 9 = -4 (-4) 2 = 16
13 13 – 9 = 4 (4) 2 = 16
3 3 – 9 = -6 (-6) 2 = 36
11 11 – 9 = 2 (2) 2 = 4
15 15 – 9 = 6 (6) 2 = 36
9 9 – 9 = 0 (0) 2 = 0
Mean: 9 Sum = 0 !!!!!!!!!! Sum = 112
xix xi − ( )x xi −
2
48
2/22/2009
9
Steps 4, and 5
• Variance
• Standard deviation
sx x x x x x
n
n2 1
2
2
2 2
1=
− + − + + −
−
( ) ( ) ( )K
s2 112
8 116=
−
=
s Variance=
s Variance= = =16 4
The “typical” distance from the mean is 4.
49
FAQ about the Standard Deviation
1. Why do we need to square the deviations?
Because the sum of the deviations from the mean is
ALWAYS 0!
2. Why do we divide by n-1 and not by n?
Because we know (question 1) that the sum of the
deviations is always 0, so that knowing n-1 of
them determines the last one. Only n-1 of the
squared deviations can vary freely. The number
n-1 is called the degrees of freedom.
50
FAQ about the Standard Deviation
3. Why do we take the square root?
s2=16 is an average of the squared deviations, and therefore has different units of measurement. In this case 16 is measured in "squared customers", which obviously cannot be interpreted. We therefore, take the square root in order to go back to the original units of measurement.
51
Facts about the standard deviation (s)
• s measures the spread about the mean and
should be used only when the mean is chosen
as the measure of center. That is, when the
distribution of the data is roughly symmetric
with no outliers.
Mean Standard
deviation
52
Facts about the Standard Deviation (s)
• s is always zero or greater than 0. s = 0 only
when there is no spread, i.e., the data values
are identical.
• s gets larger as the spread increases.
• s has the same units of measurements as the
original observations.
• Like the mean, s is not resistant. It is very
sensitive to outliers.
53
Calculator (TI-83, TI-84)
Steps:
• 1. Enter your data into a List:
– STAT � EDIT � 1: Edit…
– Enter you data into L1
• 2. Find the mean, median, standard
deviation, five-number summary…
– STAT � CALC � 1: 1-Var Stats
– You see in your window 1-Var Stats (L1)
2nd 1
54
2/22/2009
10
Try it: 23, 18, 19, 25, 27, 27, 20, 17, 24
x
x
x
Sx
x
n
X
Q
Med
Q
X
=
=
=
=
=
=
=
=
=
=
=
∑
∑
2111111111
190
4322
6 23386807
587734718
9
7
185
23
26
27
2
1
3
.
.
.
min
.
max
σ
Mean
Standard deviation
Number of entries
Five-number summary
55
Measures of Relative Standing
• We can compare values from different data sets
using z-scores:
• A z-score measures the number of standard
deviations that a data value x is from the mean.
• Ordinary values: -2 ≤ z-score ≤ 2
• Unusual values: z-score < -2 or z-score > 2
zx mean
s d=
−
. .
Example
• IQ scores have a mean of 100 and a standard
deviation of 16. Albert Einstein reportedly had
an IQ of 160. Is Einstein’s IQ score unusual?
• Since the z-score is higher than 2, we can
conclude that Einstein’s IQ score is unusual.
zx mean
s d=
−=
−=
. ..
160 100
16375
Median
Find the median of the following 9 numbers:
43 54 55 63 67 68 69 77 85
a) 65
b)64
c) 67
d)64.6
58
Median
For the data in the previous question,
43 54 55 63 67 68 69 77 85
Suppose that the last data point is actually 115 instead of 85. What effect would this new maximum have on our value for the median of the dataset?
a) Increase the value of the median.
b) Decrease the value of the median.
c) Not change the value of the median.
59
Mean
For the data in the previous question,
43 54 55 63 67 68 69 77 85
Suppose that the last data point is actually 115 instead of 85. What effect would this new maximum have on our value for the mean of the dataset?
a) Increase the value of the mean.
b) Decrease the value of the mean.
c) Not change the value of the mean.
60
2/22/2009
11
Mean vs. median
For the dataset “volumes of milk dispensed into
2-gallon milk cartons,” should you use the
mean or the median to describe the center?
a) Mean
b)Median
61
Mean vs. median
For the dataset “sales prices of homes in Los
Angeles,” should you use the mean or the
median to describe the center?
a) Mean
b)Median
62
Mean vs. medianFor the dataset “incomes for people in the
United States,” should you use the mean or
the median to describe the center?
a) Mean
b)Median
63
BoxplotsYou have a boxplot for the tar
content of 25 different
cigarettes. What is a plausible
set of values for the five-number
summary?
a) Min = 13, Q1 = 10, Median = 12.6, Q3 = 14, Max = 15
b) Min = 1, Q1 = 8.5, Median = 12.6, Q3 = 15, Max = 17
c) Min = 1, Q1 = 8.5, Median = 11.5, Q3 = 13, Max = 15
d) Min = 8.5, Q1 = 10, Median = 11.5, Q3 = 15, Max = 17
64
Boxplots
The shape of the boxplot below can be
described as:
a) Bi-modal
b)Left-skewed
c) Right-skewed
d)Symmetric
e) Uniform
65
Side-by-side boxplotsLook at the following
side-by-side boxplots
and compare the female
and male shoulder girth.
a) Females have a typically smaller shoulder girth than males.
b) Females have a typically larger shoulder girth than males.
c) Females and males have about the same shoulder girths.
66
2/22/2009
12
Side-by-side boxplotsLook at the following
side-by-side boxplots and
compare the female and
male thigh girth.
a) Females have a typically smaller thigh girth than males.
b) Females have a typically larger thigh girth than males.
c) Females and males have about the same thigh girth.
67
Comparing two histogramsCompare the centers of Distr. A (Female Shoulder Girth)
and Distr. B (Male Shoulder Girth) shown below.
a) The center of Distr. A is greater than the center of Distr. B.
b) The center of Distr. A is less than the center of Distr. B.
c) The center of Distr. A is equal to the center of Distr. B.68
Comparing two histogramsCompare the spreads of Distr.A (Female Shoulder Girth)
and Distr. B (Male Shoulder Girth) shown below.
a) The spread of Distr. A is greater than the spread of Distr. B.
b) The spread of Distr. A is less than the spread of Distr. B.
c) The spread of Distr. A is equal to the spread of Distr. B.
69
BoxplotsWhat is the approximate range of the Male Wrist Girth
dataset shown below?
a) 14.5 to 19.5
b) 16.5 to 17
c) 16.5 to 18
d) 17 to 19.5
e) 14.5 to 16.5 and 18 to 19.5
70
BoxplotsWhat is the approximate interquartile range of the
Male Wrist Girth dataset shown below?
a) 14.5 to 19.5
b) 16.5 to 17
c) 16.5 to 18
d) 17 to 19.5
e) 14.5 to 16.5 and 18 to 19.5
71
Outliers
If a dataset contains outliers, which measure of
spread is resistant?
a) Range
b) Interquartile range
c) Standard deviation
d)Variance
72
2/22/2009
13
Standard deviation
Which of the following statements is TRUE?
a) Standard deviation has no unit of measurement.
b) Standard deviation is either positive or negative.
c) Standard deviation is inflated by outliers.
d) Standard deviation is used even when the mean
is not an appropriate measure of center.
73
Center and spreadFor the following
distribution of major league baseball players’ salaries in 1992, which measures of center and spread are more appropriate?
a) Mean and standard deviation
b)Median and interquartile range
74