lecture notes 3 - colorado state universityvollmer/stat307pdfs/ln3_2017.pdf · lecture notes 3:...

Lecture Notes 3:Data summarization

Highlights:

• Average • Median • Quartiles • 5-number summary (and relation to boxplots) • Outliers • Range & IQR • Variance and standard deviation • Determining shape using mean & median

1

Some important characteristics of a data set

Location: Where is the data set “located” along a number line? Where is its center?

Spread: How dispersed (i.e. spread out) is the data?

Outliers: Are there any unusual values in the data set?

Shape: What is the shape of the distribution of values in the data set?

2

Location StatisticsMean, Median & Quartiles

3

• In these notes, we will look at some common descriptive statistics that are useful for summarizing a data set.

• Recall that a statistic is any number calculated from a set of data.

• The most succinct way to describe the location of a data set is to identify its center.

• There are two statistics used to describe center: with the mean and with the median.

Sample average• The sample average (a.k.a. mean) is the sum of the data divided by

the sample size.

• We denote the mean using , or “x bar”

• The sample size is the number of observations in the sample, and is denoted “n”.

• The sum of all the observations in a sample is denoted by .

• So, our formula for the sample mean is

x

ixxn

=∑

4

ix∑

Sample Average Example

• Suppose we are interested in the average undulation rate (in Hz) of a paradise tree snake, which undulates after jumping from a tree in order to glide away.

• We take a sample of n = 8 snakes and somehow measure the rates at which they undulate as they propel themselves from a source.

• The eight observed rates are 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6

5

Sample Average Example

ixxn

= = =∑

So, for this sample, we can compute:

6

Median• If you put data in order from the smallest to the largest values, the

number in the middle is called the median.

• The median separates the bottom 50% of the data from the top 50% of the data.

• If the sample size is odd, the median will be a value in your sample. If the sample size is even, the median will be “between” the middle two numbers in your sample.

7

Computing the median1) Order the data set, smallest to largest.

2) Compute the rank of the median using Rank = (n + 1)/2. The rank tells you which ordered observation will be the median.

3) If “Rank” is an integer value go right to it in the sorted data set. Otherwise compute the average of the two surrounding observations.

For instance, if rank = 5, then the median is the 5th ordered observation. If rank = 5.5, then the median is the average of the 5th and 6th ordered observations.

8

49 69 70 70 73 78 81 81 96 96 105 110

116 116 117 121 137 142 151

Computing the Median• The data set to the right is

already ordered. There are 19 observations.

• Find the rank of the median using (n+1)/2:• Now go to this observation by counting from the start of the data set to the rank of the median.

• You can verify that this is the median by making sure that there are the same number of observations above it as there are below it.

9

49 69 70 70 73 78 81 81 96 96 105 110

116 116 117 121 137 142 151 175

Computing the Median• The data set to the right is

already ranked. There are 20 observations.

• Find the rank of the median using (n+1)/2:

• In this case, the rank is between two integers, so the median will be the average of these two ordered observations.

10

Location Statistics: Quartiles

• The median breaks the data set into two halves

• Quartiles break the data set into 4 quarters

• The lower quartile, Q1, is the “median” of all the data below the overall median.

• The upper quartile, Q3, is the “median” of all the data above the overall median.

11

49 69 70 70 73 78 81 81 96 96 105 110

116 116 117 121 137 142 151 175

Computing Quartiles

Here, there are 10 observations below the median. We can find their “median”, Q1, in the usual manner:

Q1 separates the lower 25% from the upper 75% of the data.

12

Computing Quartiles

Q3 separates the lower 75% from the top 25% of the data.

49 69 70 70 73 78 81 81 96 96 105 110

116 116 117 121 137 142 151 175

Likewise, there are 10 observations above the median. We can use the same rank we used to find Q1, but start counting from the first observation above the overall median:

13

Computing Quartiles

• A brief aside: when sample size is odd, it will not be the case that *exactly* 50% of the data is below the median or that *exactly* 50% is above it

• This is because the median itself is not counted as being in either the upper or lower half of the data set.

• For reasonably large data sets, we may say things like “50% of the data is above the median” and “25% of the data is below Q1”, even though in some cases these are approximations.

14

Computing Quartiles• Note that for relatively small datasets, you may be

able to “eyeball” the data to find the median, Q1, and Q3, rather than using rank.

• For instance, it is not challenging to find the median and quartiles for the snake undulation rate data set of size n=8 from before.

• Simply order the numbers 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6 from smallest to largest, and you can quickly see where the median and quartiles lie:

15

Location Statistics:Extremes

• We are also often interested in the extremes of a data set.

• These extreme values are referred to as the minimum and the maximum. “Extreme” in this context doesn’t necessarily mean “really big” or “really small”. It just means “the biggest” or “the smallest”.

16

The 5-number summary

• The 5-number summary can be used to summarize a data set.

• This group consists of the: minimum, maximum, Q1, median, and Q3

• These are all measures of location

17

Boxplots and the 5-number summary

• Boxplots graphically illustrate the 5 values in a 5-number summary

• Sometimes boxplots are called “box and whisker plots.”

18

6065

7075

boxplot of height (female)


• Boxplots can be displayed horizontally or vertically. • The dark line inside the box is the median • The edges of the box are Q1 and Q3 • The whiskers extend to either the min and max, or to

the furthest non-outliers. 19


• Outliers are represented as dots on a boxplot.

• Note: 50% of the data is inside the box, 25% is below the box, and 25% is above the box.

20

Outliers• Outliers are data points that are located far away from where the

majority of the data lie.

• There is not universal agreement on what the standard should be for classifying an observation as an outlier. It is to some extent subjective.

• Data analysis software packages will have internal standards by which they decide which values should be considered outlying.

21

Outliers• It’s usually a good idea to look more closely at an outlier to see if it is real or if it is a mistake.

• The outlier might be an improperly entered data value. Data entry is a tedious process and sometimes people make mistakes.

• The outlier might be in different units than the rest of the data. For instance, in the questionnaires from the first day of class, a few students gave their heights in centimeters rather than inches. If these heights had not been converted, then our class dataset would have shown students over 12 feet tall.

22

Outliers• Outliers are often real, accurate pieces of data

that are simply unusual.

• For instance, most people work 35-40 hours per week. However a very small number work 70-80 hours a week.

• It is sometimes tempting to remove outliers from a data set, but we must find out first whether or not the outlier is a legitimate observation or a mistake.

23

Dispersion (Spread)

Here is a good piece of advice: “Do not cross a river if it is, on average, 4 feet deep”

-Nassim Taleb, The Black Swan

Why is this good advice? What additional information would we need before we decide if crossing the river is a good idea?

24

Dispersion (Spread)• Information about location (average or median) is not

enough to adequately summarize a data set.

• Sometimes the average doesn’t exist. For example, the average human being has one ovary and one testicle.

• Information about how your data is dispersed is also useful, and is essential in inferential statistics.

• We don’t just want to know where the center of our data lies; we also want to know how spread out the data is!

25

The Range• The range is the easiest measure of dispersion to

compute.

• It is the difference between the maximum value and the minimum value.

• One problem with using the range is that it doesn’t tell you whether most of the data is spread out through the whole range, or if the maximum and minimum values are outliers.

26

The IQR

• The inter-quartile range (Q3 – Q1) is not affected by extreme values since it is calculated using values that lie close to the center of the data set

• We will not use either the range or the IQR when we move on to inferential statistics. But they are still useful as descriptive statistics.

27

Variance• The variance is another measure of dispersion.

It is closely related to the standard deviation, which we will consider shortly.

• Unlike the range or IQR, the variance statistic is computed using all of the data values in a data set.

• It is sensitive to outliers, but the effects of extreme values are “diluted” if there are a large number of observations.

28

Sum of Squared Deviations• To compute the variance of a data set we first need a

statistic called the sum of squared deviations

• This is often abbreviated as SS, for “sum of squares”

• To get the squared deviation for a single observation, subtract the mean from this observation, and then square the result.

• Do this for all observations and sum the results. This gives us the sum of squared deviations.

• Mathematically,

2( )iS S x x= −∑29

Sum of Squared Deviations• Example: find the sum of squared deviations

(SS) for our TV watching dataset:

2( )iS S x x= − =∑30

0.9 1.4 1.2 1.2 1.3 2.0 1.4 1.6

Sample Variance• The sample variance is denoted by the symbol s2

• Mathematically,

• The English interpretation of a variance is:“The average squared distance that a group of ‘n’ points lies from the mean of the group.”

• This is not a very intuitive concept, though it is very often used in mathematical computations.

22 ( )

1 1ix xS S

sn n

−= =

− −∑

31

Sample Standard Deviation

• The sample standard deviation is simply the square root of the sample variance.

• It is denoted by the letter s

• Continuing with our example, we have:

2

1S Ss sn

= = =−

32

Interpret the Standard Deviation• The standard deviation can be thought of roughly as an average distance

that a group of points lies from the group mean.

• A large standard deviation tells you that your data is highly dispersed, or spread out.

• In inferential statistics, a large standard deviation signifies high levels of uncertainty regarding statistical inferences.

• Note that what counts as “large” or “small” depends on the magnitude of the data itself.

33

Shapes of Distributions

• You don’t need a histogram to determine the shape of a distribution. In fact, all you need are the values for the mean and the median of your data set.

11010090807060504030

9876543210

Freq

uenc

y

Median= 92

Mean= 86

Grades

34

110 100 90 0

80 70 60 50 40 0

30

9 8 7 6 5 4 3 2 1 0

Median= 92

Mean= 86


• What is the shape of this distribution to the right?

• Note that the mean is 86, and the median is 92

35



• Note that the mean is 2.6, and the median is 0.6

14 12 10 8 6 4 2 0

10

5

0

mean = 2.6

Median = .6

36



• Note that the mean is 102, and the median is 102

180 160 140 120 100 80 60 40 20 0

0

30

20

10

0

Mean=102

Median= 102

37

Mean, Median, & Shape

• If the mean is greater than the median then the distribution is skewed to the right

• If the mean is less than the median then the distribution is skewed to the left

• If the mean and median are (approximately) equal then the distribution is (approximately) symmetric

38

Conclusion• A statistic is any number calculated from a set of data. Descriptive statistics

are numbers that are used to describe important features of a data set.

• The mean and median are very commonly used statistics which refer to location

• The standard deviation is a very commonly used statistic which refers to dispersion.

• In the next set of notes, we will look at probability and the normal distribution, which will lay the groundwork for understanding inferential statistics.

39

lecture notes 3 - colorado state universityvollmer/stat307pdfs/ln3_2017.pdf · lecture notes 3:...

Documents