Download - Descriptive Statistics
3. DESCRIPTIVE STATISTICS
Statistics Torsten Jochem
Content
� 1. (Example) Research Question
� 2. Mean
� 3. Median
� 4. Mode
� 5. Standard Deviation and Variance
� 6. Percentiles and Quartiles
� 7. IQR and Outlier Rule
� 8. The “Five-Number Summary”
� 9. Which Descriptive Stats to use?
2
3. Descriptive Statistics
3. Descriptive Statistics
� Descriptive Statistics deals with summarizing an existing data set to give the reader a “quick feel” on the data.
� Specifically, we will look in this section at the following statistics:
� Mean
� Median
� Mode
� Variance & Standard Deviation
� Percentiles & Quartiles
3
Measures ofCentral tendency
Measures ofSpread/variability
3. Descriptive Statistics
1. Research Question
For this section, we investigate the following “research question:”
�What is the popularity of David Hasselhoff in the U.S.?
(scale 0-10; where 0=hate him and 10=love him)
Hence, the population of the study is the complete U.S.
U.S. population (~300 million). We decide to do a survey
as we cannot conduct 300 million interviews.
4
3. Descriptive Statistics
1. Research Question (cont’d)
Assume, we had done some survey with random sampling and gotten the following sample:
ratings: x={4, 2, 6, 5, 3, 4, 7, 5, 3, 4}
Note, that a sample size of N=10 is too small given a population size of some
300 million. More on this later.
5
3. Descriptive Statistics
2. Mean (μ, x )
The mean is simply the average:
� Note: the letter μ is used for the population mean,
while x is used for the sample mean.
� In our sample the mean is…
x = (4+2+6+5+3+4+7+5+3+4)/10=4.3
Note: The mean can be misleading as it is influenced by outliers!
6
∑=
=N
iix
N 1
1µ
3. Descriptive Statistics
3. Median
The median is the value at which half of the sample lies above and half of the sample lies below. (if there is an even number of observations, add the two in the middle and divide it by 2.)
Step 1: rank data set � x = {2, 3, 3, 4, 4, 4, 5, 5, 6, 7}
Step 2: even number of observations � find avg. of 2 middle values: (4+4)/2 = 4
Note: the median not influenced by outliers!
7
3. Descriptive Statistics
4. Mode
� The mode is the value that appears most often.
� It is not uncommon to have bi-mode or multi-mode distributions.
� In our example
Step 1: rank data set � x = {2, 3, 3, 4, 4, 4, 5, 5, 6, 7}
Step 2: mode = 4 (appears 3 times)
Note: the mode is not influenced by outliers!
8
3. Descriptive Statistics
5. Standard Deviation (σ, s) and Variance (σ2, s2)
Measures the spread/variability of the data. The standard deviation can be understood as the average distance between any value and the mean.
Std dev of the population:
Std dev of the sample:
9
N
xN
ii
2
1
)(∑=
−=
µσ
1
)( 2
1
−
−=∑
=
N
xs
N
ii µ
Note, “N-1” instead of “N”.
Reason complex, but essentially so that the std dev estimate is (on average) not
biased. (You will likely hear more
on this in intermediate
stats.)
3. Descriptive Statistics
5. Standard Deviation (σ, s) and Variance (σ2, s2)
In our example, we don’t have data of the whole population, but only data of a sample of it. Hence, we can only compute the sample std. dev.
std dev (x) = 1.5
10
1
)( 2
1
−
−=∑
=
N
xs
N
ii µ
3. Descriptive Statistics
5. Standard Deviation (σ, s) and Variance (σ2, s2)
Note that the standard deviation is essential to know. The 2 samples {49,50,51} and {0,50,100} have the same mean and the same median, but the samples are very different.
� Without some measure of spread, a data summary is incomplete.
11
3. Descriptive Statistics
5. Standard Deviation (σ, s) and Variance (σ2, s2)
The Variance is simply the squared standard deviation.
Population variance =
Sample variance =
12
N
xN
ii
2
12
)(∑=
−=
µσ
1
)( 2
12
−
−=∑
=
N
xs
N
ii µ
3. Descriptive Statistics
6. Percentiles & Quartiles
Measures of relative standing. It is the %value of how many observations fall below a specific observation. � Example: if your SAT score is in the 90th percentile, then
you did better than 90% of the other students.
� The median is the 50th percentile (center value of ordered list)
� To find the kth percentile (say, k = 90)
� Step 1: rank the sample. x = {2, 3, 3, 4, 4, 4, 5, 5, 6, 7}
� Step 2: multiply k times N. (Here: 0.9*10=9)
� Step 3: count to the “k times N”th value (9th value) in the ranked
sample � 6 is the 90th percentile in our data set.
13
14
3. Descriptive Statistics
6. Percentiles and Quartiles (cont’d)
� The 1st quartile (Q1) is the 25th percentile.
� The 3rd quartile (Q3) is the 75th percentile.
� You can get Q1 and Q3 by taking the median of the first half and the second half of the sample:
{2, 3, 3, 4, 4, 4, 5, 5, 6, 7}
Median: (4+4)/2 = 4
Q1: 3 Q3: 5
15
3. Descriptive Statistics
7. The IQR and the Outlier-Rule
� Interquartile Range (IQR) = Q3 – Q1
� Outlier Rule
� Any values
above (Q3 + 1.5*IQR) = upper bound
and below (Q1 – 1.5*IQR) = lower bound
are called “suspected outliers.”
� In our previous example (last slide):
� Upper bound = 5 + 1.5 (5-3) = 8
� Lower bound = 3 – 1.5 (5-3) = 0
� no suspected outliers in our sample.
3. Descriptive Statistics
8. Five-Number Summary
� Minimum, Q1, Median, Q3, Maximum
� Preferred to Mean & Std.dev/Variance if the distribution is skewed by outliers.
16
3. Descriptive Statistics
9. Which Descriptive Stats to use?
� If distribution is symmetric with no strong outliers
� Mean, Std.Dev/Variance
� If distribution is asymmetric and/or with strong outliers
� 5-Number Summary
17
Any Questions?
18
3. Descriptive Statistics