Download - Descriptive Statistics

3. DESCRIPTIVE STATISTICS

Statistics Torsten Jochem

Content

� 1. (Example) Research Question

� 2. Mean

� 3. Median

� 4. Mode

� 5. Standard Deviation and Variance

� 6. Percentiles and Quartiles

� 7. IQR and Outlier Rule

� 8. The “Five-Number Summary”

� 9. Which Descriptive Stats to use?

2

3. Descriptive Statistics


� Descriptive Statistics deals with summarizing an existing data set to give the reader a “quick feel” on the data.

� Specifically, we will look in this section at the following statistics:

� Mean

� Median

� Mode

� Variance & Standard Deviation

� Percentiles & Quartiles

3

Measures ofCentral tendency

Measures ofSpread/variability


1. Research Question

For this section, we investigate the following “research question:”

�What is the popularity of David Hasselhoff in the U.S.?

(scale 0-10; where 0=hate him and 10=love him)

Hence, the population of the study is the complete U.S.

U.S. population (~300 million). We decide to do a survey

as we cannot conduct 300 million interviews.

4


1. Research Question (cont’d)

Assume, we had done some survey with random sampling and gotten the following sample:

ratings: x={4, 2, 6, 5, 3, 4, 7, 5, 3, 4}

Note, that a sample size of N=10 is too small given a population size of some

300 million. More on this later.

5


2. Mean (μ, x )

The mean is simply the average:

� Note: the letter μ is used for the population mean,

while x is used for the sample mean.

� In our sample the mean is…

x = (4+2+6+5+3+4+7+5+3+4)/10=4.3

Note: The mean can be misleading as it is influenced by outliers!

6

∑=

=N

iix

N 1

1µ


3. Median

The median is the value at which half of the sample lies above and half of the sample lies below. (if there is an even number of observations, add the two in the middle and divide it by 2.)

Step 1: rank data set � x = {2, 3, 3, 4, 4, 4, 5, 5, 6, 7}

Step 2: even number of observations � find avg. of 2 middle values: (4+4)/2 = 4

Note: the median not influenced by outliers!

7


4. Mode

� The mode is the value that appears most often.

� It is not uncommon to have bi-mode or multi-mode distributions.

� In our example

Step 1: rank data set � x = {2, 3, 3, 4, 4, 4, 5, 5, 6, 7}

Step 2: mode = 4 (appears 3 times)

Note: the mode is not influenced by outliers!

8


5. Standard Deviation (σ, s) and Variance (σ2, s2)

Measures the spread/variability of the data. The standard deviation can be understood as the average distance between any value and the mean.

Std dev of the population:

Std dev of the sample:

9

N

xN

ii

2

1

)(∑=

−=

µσ

1

)( 2

1

−

−=∑

=

N

xs

N

ii µ

Note, “N-1” instead of “N”.

Reason complex, but essentially so that the std dev estimate is (on average) not

biased. (You will likely hear more

on this in intermediate

stats.)



In our example, we don’t have data of the whole population, but only data of a sample of it. Hence, we can only compute the sample std. dev.

std dev (x) = 1.5

10

1

)( 2

1

−

−=∑

=

N

xs

N

ii µ



Note that the standard deviation is essential to know. The 2 samples {49,50,51} and {0,50,100} have the same mean and the same median, but the samples are very different.

� Without some measure of spread, a data summary is incomplete.

11



The Variance is simply the squared standard deviation.

Population variance =

Sample variance =

12

N

xN

ii

2

12

)(∑=

−=

µσ

1

)( 2

12

−

−=∑

=

N

xs

N

ii µ


6. Percentiles & Quartiles

Measures of relative standing. It is the %value of how many observations fall below a specific observation. � Example: if your SAT score is in the 90th percentile, then

you did better than 90% of the other students.

� The median is the 50th percentile (center value of ordered list)

� To find the kth percentile (say, k = 90)

� Step 1: rank the sample. x = {2, 3, 3, 4, 4, 4, 5, 5, 6, 7}

� Step 2: multiply k times N. (Here: 0.9*10=9)

� Step 3: count to the “k times N”th value (9th value) in the ranked

sample � 6 is the 90th percentile in our data set.

13

14


6. Percentiles and Quartiles (cont’d)

� The 1st quartile (Q1) is the 25th percentile.

� The 3rd quartile (Q3) is the 75th percentile.

� You can get Q1 and Q3 by taking the median of the first half and the second half of the sample:

{2, 3, 3, 4, 4, 4, 5, 5, 6, 7}

Median: (4+4)/2 = 4

Q1: 3 Q3: 5

15


7. The IQR and the Outlier-Rule

� Interquartile Range (IQR) = Q3 – Q1

� Outlier Rule

� Any values

above (Q3 + 1.5*IQR) = upper bound

and below (Q1 – 1.5*IQR) = lower bound

are called “suspected outliers.”

� In our previous example (last slide):

� Upper bound = 5 + 1.5 (5-3) = 8

� Lower bound = 3 – 1.5 (5-3) = 0

� no suspected outliers in our sample.


8. Five-Number Summary

� Minimum, Q1, Median, Q3, Maximum

� Preferred to Mean & Std.dev/Variance if the distribution is skewed by outliers.

16


9. Which Descriptive Stats to use?

� If distribution is symmetric with no strong outliers

� Mean, Std.Dev/Variance

� If distribution is asymmetric and/or with strong outliers

� 5-Number Summary

17

Any Questions?

18


Download - Descriptive Statistics

Top Related