chapter 5 describing distributions numerically. describing the distribution center median (.5...

Post on 05-Jan-2016

227 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Chapter 5

Describing Distributions Numerically

Describing the Distribution Center

Median (.5 quantile, 2nd quartile, 50th percentile)

Mean Spread

Range Interquartile Range Standard Deviation

Median

Literally = middle number (data value)

Has the same units as the data n (number of observations) is odd

Order the data from smallest to largest Median is the middle number on the list (n+1)/2 number from the smallest value

• Ex: If n=11, median is the (11+1)/2 = 6th number from the smallest value

• Ex: If n=37, median is the (37+1)/2 = 19th number from the smallest value

Example – Frank Thomas

Career Home Runs 4 7 15 18 24 28 29 32 35 38 40 40 41 42 43

Remember to order the values, if they aren’t already in order!

• 15 observations– (15+1)/2 = 8th

observation from bottom

• Median = 32 HRs

Median

n is even Order the data from smallest to largest

Median is the average of the two middle numbers

(n+1)/2 will be halfway between these two numbers•Ex: If n=10, (10+1)/2 = 5.5, median is average of 5th and 6th numbers from smallest value

Example – Ryne Sandberg

Career Home Runs0 5 7 8 9 12 14 16 19 19 25 26 26 26 30 40 Remember to order the values if they aren’t already in order!

• 16 observations– (16 + 1)/2 = 8.5,

average of 8th and 9th observations from bottom

• Median = average of 16 and 19

• Median = 17.5 HRs

Mean

Ordinary average Add up all observations Divide by the number of observations

Has the same units as the data Formula

n observations y1, y2, y3, …, yn are the values

Mean

y y1 y2 y3 L yn

n

yn

1

ny

Examples

Thomas

Sandberg

(4 7 15 18 ... 43)

1526.4HRs

(0 5 7 8 ... 40)

1617.625 HRs

Mean vs. Median

Median = middle number Mean = value where histogram balances

Mean and Median similar when Data are symmetric

Mean and median different when Data are skewed There are outliers

Mean vs. Median

Mean influenced by unusually high or unusually low values Example: Income in a small town of 6 people

$25,000 $27,000 $29,000 $35,000 $37,000 $38,000

**The mean income is $31,830**The median income is $32,000

Mean vs. Median

Bill Gates moves to town$25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000

**The mean income is $5,741,571**The median income is $35,000

Mean is pulled by the outlier Median is not Mean is not a good center of these data

Mean vs. Median

Skewness pulls the mean in the direction of the tail Skewed to the right = mean > median Skewed to the left = mean < median

Outliers pull the mean in their direction Large outlier = mean > median Small outlier = mean < median

Spread

Range = maximum – minimum Thomas

Min = 4, Max = 43, Range = 43 - 4 = 39 HRs

Sandberg Min = 0, Max = 40, Range = 40 - 0 = 40 HRs

Spread

Range is a very basic measure of spread It is highly affected by outliers Makes spread appear larger than reality

Ex. The annual numbers of deaths from tornadoes in the U.S. from 1990 to 2000:

53 39 39 33 69 30 25 67 130 94 40• Range with outlier: 130 – 25 = 105 tornadoes• Range without outlier: 94 – 25 = 69 tornadoes

Spread

Interquartile Range (IQR) First Quartile (Q1)

•Larger than about 25% of the data Third Quartile (Q3)

•Larger than about 75% of the data

IQR = Q3 – Q1 Center (Middle) 50% of the values

Finding Quartiles

Order the data Split into two halves at the median When n is odd, include the median in both halves

When n is even, do not include the median in either half

Q1 = median of the lower half Q3 = median of the upper half

Example – Frank Thomas

Order the values (15 values)

4 7 15 18 24 28 29 32 35 38 40 40 41 42 43Lower Half = 4 7 15 18 24 28 29 32

Q1 = Median of lower half = 21 HRs Upper Half = 32 35 38 40 40 41 42 43 Q3 = Median of upper half = 40 HRs

IQR = 40 – 21 = 19 HRs

Example – Ryne Sandberg Order the values (16 values) 0 5 7 8 9 12 14 16 19 19 25 26 26 26 30 40

Lower Half = 0 5 7 8 9 12 14 16 Q1 = Median of lower half = 8.5 HRs

Upper Half =19 19 25 26 26 26 30 40 Q3 = Median of upper half = 26 HRs

IQR = Q3 – Q1 = 26 – 8.5 = 17.5 HRs

Five Number Summary

Minimum Q1 Median Q3 Maximum

Examples Thomas

Min = 4 HRs Q1 = 21 HRs Median = 32 HRs Q3 = 40 HRs Max = 43 HRs

Sandberg Min = 0 HRs Q1 = 8.5 HRs Median = 17.5 HRs Q3 = 26 HRs Max = 40 HRs

Graph of Five Number Summary Boxplot

Box between Q1 and Q3 Line in the box marks the median Lines extend out to minimum and maximum

Best used for comparisons Use this simpler method

Example – Thomas & Sandberg Boxplot of Thomas Home Runs

Box from 21 to 40 Line in box 32 Lines extend out from box from 4 and 43

Boxplot of Sandberg Home Runs Box from 8.5 to 26 Line in box at 17.5 Lines extend out from box to 0 and 40

Side by Side Boxplots of Thomas & Sandberg Home Runs

Spread

Standard deviation “Average” spread from mean Most common measure of spread

•(Although it is influenced by skewness and outliers)

Denoted by letter s Make a table when calculating by hand

Standard Deviation

s (y1 y )2 (y2 y )2 K (yn y )2

n 1

y y 2n 1

1

n 1y y 2

Example – Deaths from Tornadoes

53 53-56.27 =-3.27 10.69

39 39-56.27 = -17.27 298.25

39 39-56.27 = -17.27 298.25

33 33-56.27 = -23.27 541.49

69 69-56.27 = 12.73 162.05

30 30-56.27 = -26.27 690.11

25 25-56.27 = -31.27 977.81

67 67-56.27 = 10.73 115.13

130 130-56.27 = 73.73 5436.11

94 94-56.27 = 37.73 1423.55

40 40-56.27 = -16.27 264.71

y )( yy 2)( yy

s 10.69 298.25 L 264.71

11 131.97 tornadoes

Example – Frank Thomas Find the standard deviation of the number of home runs given the following statistic:

74.2329)( 2 yy

s (y y )2n 1

2329.74

15 112.9HRs

Properties of s

s = 0 only when all observations are equal; otherwise, s > 0

s has the same units as the data s is not resistant

Skewness and outliers affect s, just like mean

Tornado Example: • s with outlier: 31.97 tornadoes• s without outlier: 21.70 tornadoes

Which summaries should you use? What numbers are affected by outliers? Mean Standard deviation Range

What numbers are not affected by outliers? Median IQR

Which summaries should you use? Five Number Summary

Skewed Data Data with outliers

Mean and Standard Deviation Symmetric Data

ALWAYS PLOT YOUR DATA!!

top related