chapter 3 describing distributions with numbersan73773/slidesclass3.pdfchapter 3 section3-2:...

13
2/22/2009 1 1 Chapter 3 Chapter 3 Section3-2: Measures of Center Section 3-3: Measurers of Variation Section 3-4: Measures of Relative Standing Section 3-5: Exploratory Data Analysis Describing Distributions with Numbers The overall pattern of the distribution of a quantitative variable is described by its shape, center, and spread. By inspecting the histogram we can describe the shape of the distribution, but as we saw, we can only get a rough estimate for the center and spread. We need a more precise numerical description of the center and spread of the distribution. Center and Spread About the same center Wide spread Narrow spread Measures of Center and Spread CENTER Mean Median Mode SPREAD Range Inter-quartile range (IQR) Variance/Standard deviation SPREAD Range Inter-quartile range (IQR) Variance/Standard deviation Mean: the balance point Mean: “x-bar” the sum of the observations divided by the number of observations). If the n observations are then x x x x x n 1 2 3 , , , , K x x x x x n x n n i i n = + + + + = = 1 2 3 1 K Example: Mean x x x x x 1 2 3 4 5 5 7 3 38 7 = = = = = , , , , , The mean is x x x x x x n x n x hours i i = + + + + = = + + + + = = 1 2 3 4 5 1 5 5 7 3 38 7 5 12

Upload: others

Post on 15-Mar-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

2/22/2009

1

1

Chapter 3Chapter 3

Section3-2: Measures of Center

Section 3-3: Measurers of Variation

Section 3-4: Measures of Relative Standing

Section 3-5: Exploratory Data Analysis

Describing Distributions with Numbers

• The overall pattern of the distribution of a quantitative

variable is described by its shape, center, and spread.

• By inspecting the histogram we can describe the shape

of the distribution, but as we saw, we can only get a

rough estimate for the center and spread.

• We need a more precise numerical description of the

center and spread of the distribution.

Center and Spread

About the same center

Wide spread

Narrow spread

Measures of Center and Spread

CENTER

• Mean

• Median

• Mode

SPREAD

• Range

• Inter-quartile range

(IQR)

• Variance/Standard

deviation

SPREAD

• Range

• Inter-quartile range

(IQR)

• Variance/Standard

deviation

Mean: the balance point

• Mean: “x-bar”

• the sum of the observations divided by the

number of observations).

If the n observations are

then

x

x x x xn1 2 3, , ,,K

xx x x x

n

x

n

n

i

i

n

=+ + + +

==

∑1 2 3 1K

Example: Mean

x x x x x1 2 3 4 55 7 3 38 7= = = = =, , , ,,

The mean is

xx x x x x

n

x

n

x hours

i

i=

+ + + +=

=+ + + +

=

=

∑1 2 3 4 5 1

5

5 7 3 38 7

512

2/22/2009

2

Median

• The median M is the midpoint of the

distribution (like the median strip in a road)

• It is the number such that

half of the observations

fall above and half fall below.

How to find the median?

1.Order the data from smallest to largest.

2. If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n+1)/2 spot in the ordered list.

2. If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered list.

Median Example: Median

• Step 1: order the data:

3 5 7 7 38

x x x x x1 2 3 4 55 7 3 38 7= = = = =, , , ,,

Median = 7

Example 2: Median

• Step 1: order the data:

3 5 7 38

x x x x1 2 3 45 7 3 38= = = =, , ,,

Median =

If the data are

5 7

26

+=

ExampleData set A:

64 65 66 68 70 71 73

• median is 68

• mean is 68.1

Data set B:

64 65 66 68 70 71 730

• median is still 68

• mean is 162

outlier

The mean is very sensitive to outliers,

while the median is resistant to outliers.

2/22/2009

3

Comparing the mean and the median

Mean

• describes the center as

an average value, where

the actual values of the

data points play an

important role

• Sensitive to outliers

Median

• locates the middle value

as the center, and the

order of the data is the

key to finding it.

• Not sensitive to outliers

Symmetric distributions with no

outliers

Mean ≈Median

Left-skewed distributions

Mean < Median

The tail

“pulls”

the mean.

Right-skewed distributions

Median< Mean

Median=68 Mean=72

The tail

“pulls”

the mean.

Which measure of center to use?

• We will therefore use the mean as a measure

of center for symmetric distributions with no

outliers.

• Otherwise, the median will be a more

appropriate measure of the center of our

data.

Another measure of center: Mode

• Mode is the most frequent value in the data set

• Data: 2, 4, 4, 4, 5, 5, 6, 7, 8, 10, 15

Mode = 4

• Data: 2, 4, 4, 4, 5, 5, 5, 6, 7, 8, 10, 15

Mode = 4, 5

• Data: 2, 4, 5, 6, 7, 8, 10, 15

No mode

2/22/2009

4

Summary: Measures of Center

• The three main numerical measures for the center of a

distribution are the mean , the median M, and the mode.

• The mean is the average value, while the median is the

middle value. The mode is the most frequent value.

• The mean is very sensitive to outliers, while the median is

resistant to outliers.

• The mean is an appropriate measure of center only for

symmetric distributions with no outliers. In all other cases,

the median should be used to describe the center of the

distribution.

x

Spread

• Spread: how far from the center the data tend

range.

• If all the data points are identical, there would

be no spread at all. Numerically, the spread

would be zero.

• Ex.: 5 5 5 5 5 5 5 5 5 5 5

Center: 5

Spread: 0

Measures of Spread Measures of Spread

• Range,

• Inter-quartile range (IQR),

• Variance / Standard deviation

These measures provide different ways to

quantify the variability of the distribution.

Range

• Range = max. value – min. value

• Example:

Data: 2, 4, 4, 4, 5, 5, 6, 7, 8, 10, 15

Range= max.- min. = 15-2 = 13

Inter-quartile range (IQR)

• the IQR gives the range covered by the

MIDDLE 50% of the ordered data

2/22/2009

5

How to find the IQR?

• Step 1: arrange the data in increasing order

• Step 2: find the median

How to find the IQR?

• Step 3: Find the median of the lower 50% of

the data. This is called the first quartile of the

distribution and is denoted by Q1.

How to find the IQR?

• Step 4: Repeat this again for the top 50% of

the data. Find the median of the top 50% of

the data. This is called the third quartile of the

distribution and is denoted by Q3.

IQR

• The middle 50% of the data falls between Q1

and Q3, and therefore: IQR = Q3 - Q1

IQR

Q1

Q3Q1

Q3

Example

• Weights of 10 students:

102, 118, 120, 136, 138, 149, 157, 157, 161, 180

M =+

=138 149

21435.Q1 Q3

IQR = 157-120 = 37

2/22/2009

6

Note

• The IQR should be used as a measure of

spread of a distribution only when the median

is used as a measure of center.

Median IQR

Using the IQR to detect outliers

• The 1.5(IQR) Criterion for Outliers

• An observation is considered a

suspected outlier if it is

– below Q1 - 1.5(IQR) or

– above Q3 + 1.5(IQR)

The 1.5(IQR) Criterion for OutliersExample 1

• Weights of 10 students:

102, 118, 120, 136, 138, 149, 157, 157, 161, 215

M =+

=138 149

21435.Q1 Q3

IQR = 157-120 = 37

Q IQR3

15 157 15 37 212 5+ = + =. . ( ) .

Anything above 212.5? YES. 215 IS an outlier.

Example 2

• Data:

-15, 8, 9, 12, 14, 19, 22, 23, 23, 45, 50

MQ1 Q3

IQR= 23-9 = 14 1.5 (IQR)= 1.5 (14) =21

Anything below Q1-1.5(IQR)=9 – 21 = -12? YES!

Anything above Q3+1.5(IQR)=23 +21 = 44? YES!

Outlier!Outliers!

Five-number summary

• To get a quick summary of both center and

spread, we consider these five numbers:

– Mininum value

– Q1

– Median Five-number summary

– Q3

– Maximum value

2/22/2009

7

Boxplot

• John Tukey invented

another kind of display

to show off the five-

number summary. It’s

called boxplot.

Example 1

• Weights of 10 students:

102, 118, 120, 136, 138, 149, 157, 157, 161, 180

M =+

=138 149

21435.Q1 Q3Min. Max.

100 110 120 130 140 150 160 170 180

Example 2

• Weights of 10 students:

102, 118, 120, 136, 138, 149, 157, 157, 161, 215

M =+

=138 149

21435.Q1 Q3Min. “New”

Max.

100 110 120 130 140 150 160 170 180

Outlier

*

M = median = 3.4

Q3= third quartile

= 4.35

Q1= first quartile

= 2.2

25 6 6.1

24 5 5.6

23 4 5.3

22 3 4.9

21 2 4.7

20 1 4.5

19 6 4.2

18 5 4.1

17 4 3.9

16 3 3.8

15 2 3.7

14 1 3.6

13 3.4

12 6 3.3

11 5 2.9

10 4 2.8

9 3 2.5

8 2 2.3

7 1 2.3

6 6 2.1

5 5 1.5

4 4 1.9

3 3 1.6

2 2 1.2

1 1 0.6

Largest = max = 6.1

Smallest = min = 0.6

Disease X

0

1

2

3

4

5

6

7

Ye

ars

un

til d

ea

th“Five-number summary”

Example 3

Interpretation

Low variability

High variability

Comparing distributions

• Boxplots are best

used for side-by-

side comparison

of more than one

distribution.

2/22/2009

8

Summary• Measures of the center of distributions:

– Mean

– Median

– Mode

• Measures of spread of distributions:

– Range

– IQR

• Using IQR to detect outliers—the 1.5(IQR) rule

• Boxplots

• Variance/Standard deviation

43

Variance and Standard Deviation:

The idea

• The standard

deviation gives the

average (or typical

distance) between

a data point and

the mean, .x

44

The formulas

• We have n observations:

• Variance:

• Standard deviation:

sx x x x x x

n

n2 1

2

2

2 2

1=

− + − + + −

( ) ( ) ( )K

x x xn1 2, , ,K

sx x x x x x

n

n=

− + − + + −

( ) ( ) ( )1

2

2

2 2

1

K

45

Example: Video Store Customers

• The following are the number of customers

who entered a video store in 8 consecutive

hours: 7, 9, 5, 13, 3, 11, 15, 9

• Find the mean and the standard deviation of the

distribution.

Hour 1st 2nd 3rd 4th 5th 6th 7th 8th

# of customers 7 9 5 13 3 11 15 9

x x x x x x x x1 2 3 4 5 6 7 87 9 5 13 3 11 15 9= = = = = = = =, , , , , , ,

46

Dotplot and the Mean

Mean = 9

x =+ + + + + + +

=7 9 5 13 3 11 15 9

89

47

Standard deviation: Steps 1, 2, and 3Observations Deviations Squared deviations

7 7 – 9 = -2 (-2) 2 = 4

9 9 – 9 = 0 (0) 2 = 0

5 5 – 9 = -4 (-4) 2 = 16

13 13 – 9 = 4 (4) 2 = 16

3 3 – 9 = -6 (-6) 2 = 36

11 11 – 9 = 2 (2) 2 = 4

15 15 – 9 = 6 (6) 2 = 36

9 9 – 9 = 0 (0) 2 = 0

Mean: 9 Sum = 0 !!!!!!!!!! Sum = 112

xix xi − ( )x xi −

2

48

2/22/2009

9

Steps 4, and 5

• Variance

• Standard deviation

sx x x x x x

n

n2 1

2

2

2 2

1=

− + − + + −

( ) ( ) ( )K

s2 112

8 116=

=

s Variance=

s Variance= = =16 4

The “typical” distance from the mean is 4.

49

FAQ about the Standard Deviation

1. Why do we need to square the deviations?

Because the sum of the deviations from the mean is

ALWAYS 0!

2. Why do we divide by n-1 and not by n?

Because we know (question 1) that the sum of the

deviations is always 0, so that knowing n-1 of

them determines the last one. Only n-1 of the

squared deviations can vary freely. The number

n-1 is called the degrees of freedom.

50

FAQ about the Standard Deviation

3. Why do we take the square root?

s2=16 is an average of the squared deviations, and therefore has different units of measurement. In this case 16 is measured in "squared customers", which obviously cannot be interpreted. We therefore, take the square root in order to go back to the original units of measurement.

51

Facts about the standard deviation (s)

• s measures the spread about the mean and

should be used only when the mean is chosen

as the measure of center. That is, when the

distribution of the data is roughly symmetric

with no outliers.

Mean Standard

deviation

52

Facts about the Standard Deviation (s)

• s is always zero or greater than 0. s = 0 only

when there is no spread, i.e., the data values

are identical.

• s gets larger as the spread increases.

• s has the same units of measurements as the

original observations.

• Like the mean, s is not resistant. It is very

sensitive to outliers.

53

Calculator (TI-83, TI-84)

Steps:

• 1. Enter your data into a List:

– STAT � EDIT � 1: Edit…

– Enter you data into L1

• 2. Find the mean, median, standard

deviation, five-number summary…

– STAT � CALC � 1: 1-Var Stats

– You see in your window 1-Var Stats (L1)

2nd 1

54

2/22/2009

10

Try it: 23, 18, 19, 25, 27, 27, 20, 17, 24

x

x

x

Sx

x

n

X

Q

Med

Q

X

=

=

=

=

=

=

=

=

=

=

=

2111111111

190

4322

6 23386807

587734718

9

7

185

23

26

27

2

1

3

.

.

.

min

.

max

σ

Mean

Standard deviation

Number of entries

Five-number summary

55

Measures of Relative Standing

• We can compare values from different data sets

using z-scores:

• A z-score measures the number of standard

deviations that a data value x is from the mean.

• Ordinary values: -2 ≤ z-score ≤ 2

• Unusual values: z-score < -2 or z-score > 2

zx mean

s d=

. .

Example

• IQ scores have a mean of 100 and a standard

deviation of 16. Albert Einstein reportedly had

an IQ of 160. Is Einstein’s IQ score unusual?

• Since the z-score is higher than 2, we can

conclude that Einstein’s IQ score is unusual.

zx mean

s d=

−=

−=

. ..

160 100

16375

Median

Find the median of the following 9 numbers:

43 54 55 63 67 68 69 77 85

a) 65

b)64

c) 67

d)64.6

58

Median

For the data in the previous question,

43 54 55 63 67 68 69 77 85

Suppose that the last data point is actually 115 instead of 85. What effect would this new maximum have on our value for the median of the dataset?

a) Increase the value of the median.

b) Decrease the value of the median.

c) Not change the value of the median.

59

Mean

For the data in the previous question,

43 54 55 63 67 68 69 77 85

Suppose that the last data point is actually 115 instead of 85. What effect would this new maximum have on our value for the mean of the dataset?

a) Increase the value of the mean.

b) Decrease the value of the mean.

c) Not change the value of the mean.

60

2/22/2009

11

Mean vs. median

For the dataset “volumes of milk dispensed into

2-gallon milk cartons,” should you use the

mean or the median to describe the center?

a) Mean

b)Median

61

Mean vs. median

For the dataset “sales prices of homes in Los

Angeles,” should you use the mean or the

median to describe the center?

a) Mean

b)Median

62

Mean vs. medianFor the dataset “incomes for people in the

United States,” should you use the mean or

the median to describe the center?

a) Mean

b)Median

63

BoxplotsYou have a boxplot for the tar

content of 25 different

cigarettes. What is a plausible

set of values for the five-number

summary?

a) Min = 13, Q1 = 10, Median = 12.6, Q3 = 14, Max = 15

b) Min = 1, Q1 = 8.5, Median = 12.6, Q3 = 15, Max = 17

c) Min = 1, Q1 = 8.5, Median = 11.5, Q3 = 13, Max = 15

d) Min = 8.5, Q1 = 10, Median = 11.5, Q3 = 15, Max = 17

64

Boxplots

The shape of the boxplot below can be

described as:

a) Bi-modal

b)Left-skewed

c) Right-skewed

d)Symmetric

e) Uniform

65

Side-by-side boxplotsLook at the following

side-by-side boxplots

and compare the female

and male shoulder girth.

a) Females have a typically smaller shoulder girth than males.

b) Females have a typically larger shoulder girth than males.

c) Females and males have about the same shoulder girths.

66

2/22/2009

12

Side-by-side boxplotsLook at the following

side-by-side boxplots and

compare the female and

male thigh girth.

a) Females have a typically smaller thigh girth than males.

b) Females have a typically larger thigh girth than males.

c) Females and males have about the same thigh girth.

67

Comparing two histogramsCompare the centers of Distr. A (Female Shoulder Girth)

and Distr. B (Male Shoulder Girth) shown below.

a) The center of Distr. A is greater than the center of Distr. B.

b) The center of Distr. A is less than the center of Distr. B.

c) The center of Distr. A is equal to the center of Distr. B.68

Comparing two histogramsCompare the spreads of Distr.A (Female Shoulder Girth)

and Distr. B (Male Shoulder Girth) shown below.

a) The spread of Distr. A is greater than the spread of Distr. B.

b) The spread of Distr. A is less than the spread of Distr. B.

c) The spread of Distr. A is equal to the spread of Distr. B.

69

BoxplotsWhat is the approximate range of the Male Wrist Girth

dataset shown below?

a) 14.5 to 19.5

b) 16.5 to 17

c) 16.5 to 18

d) 17 to 19.5

e) 14.5 to 16.5 and 18 to 19.5

70

BoxplotsWhat is the approximate interquartile range of the

Male Wrist Girth dataset shown below?

a) 14.5 to 19.5

b) 16.5 to 17

c) 16.5 to 18

d) 17 to 19.5

e) 14.5 to 16.5 and 18 to 19.5

71

Outliers

If a dataset contains outliers, which measure of

spread is resistant?

a) Range

b) Interquartile range

c) Standard deviation

d)Variance

72

2/22/2009

13

Standard deviation

Which of the following statements is TRUE?

a) Standard deviation has no unit of measurement.

b) Standard deviation is either positive or negative.

c) Standard deviation is inflated by outliers.

d) Standard deviation is used even when the mean

is not an appropriate measure of center.

73

Center and spreadFor the following

distribution of major league baseball players’ salaries in 1992, which measures of center and spread are more appropriate?

a) Mean and standard deviation

b)Median and interquartile range

74