descriptive statistics

5
3. DESCRIPTIVE STATISTICS Statistics Torsten Jochem Content 1. (Example) Research Question 2. Mean 3. Median 4. Mode 5. Standard Deviation and Variance 6. Percentiles and Quartiles 7. IQR and Outlier Rule 8. The “Five-Number Summary” 9. Which Descriptive Stats to use? 2 3. Descriptive Statistics 3. Descriptive Statistics Descriptive Statistics deals with summarizing an existing data set to give the reader a “quick feel” on the data. Specifically, we will look in this section at the following statistics: Mean Median Mode Variance & Standard Deviation Percentiles & Quartiles 3 Measures of Central tendency Measures of Spread/variability 3. Descriptive Statistics 1. Research Question For this section, we investigate the following “research question:” What is the popularity of David Hasselhoff in the U.S.? (scale 0-10; where 0=hate him and 10=love him) Hence, the population of the study is the complete U.S. U.S. population (~300 million). We decide to do a survey as we cannot conduct 300 million interviews. 4

Upload: phenomenon

Post on 07-Dec-2015

214 views

Category:

Documents


0 download

DESCRIPTION

descriptive statistics

TRANSCRIPT

Page 1: Descriptive Statistics

3. DESCRIPTIVE STATISTICS

Statistics Torsten Jochem

Content

� 1. (Example) Research Question

� 2. Mean

� 3. Median

� 4. Mode

� 5. Standard Deviation and Variance

� 6. Percentiles and Quartiles

� 7. IQR and Outlier Rule

� 8. The “Five-Number Summary”

� 9. Which Descriptive Stats to use?

2

3. Descriptive Statistics

3. Descriptive Statistics

� Descriptive Statistics deals with summarizing an existing data set to give the reader a “quick feel” on the data.

� Specifically, we will look in this section at the following statistics:

� Mean

� Median

� Mode

� Variance & Standard Deviation

� Percentiles & Quartiles

3

Measures ofCentral tendency

Measures ofSpread/variability

3. Descriptive Statistics

1. Research Question

For this section, we investigate the following “research question:”

�What is the popularity of David Hasselhoff in the U.S.?

(scale 0-10; where 0=hate him and 10=love him)

Hence, the population of the study is the complete U.S.

U.S. population (~300 million). We decide to do a survey

as we cannot conduct 300 million interviews.

4

Page 2: Descriptive Statistics

3. Descriptive Statistics

1. Research Question (cont’d)

Assume, we had done some survey with random sampling and gotten the following sample:

ratings: x={4, 2, 6, 5, 3, 4, 7, 5, 3, 4}

Note, that a sample size of N=10 is too small given a population size of some

300 million. More on this later.

5

3. Descriptive Statistics

2. Mean (μ, x )

The mean is simply the average:

� Note: the letter μ is used for the population mean,

while x is used for the sample mean.

� In our sample the mean is…

x = (4+2+6+5+3+4+7+5+3+4)/10=4.3

Note: The mean can be misleading as it is influenced by outliers!

6

∑=

=N

iix

N 1

3. Descriptive Statistics

3. Median

The median is the value at which half of the sample lies above and half of the sample lies below. (if there is an even number of observations, add the two in the middle and divide it by 2.)

Step 1: rank data set � x = {2, 3, 3, 4, 4, 4, 5, 5, 6, 7}

Step 2: even number of observations � find avg. of 2 middle values: (4+4)/2 = 4

Note: the median not influenced by outliers!

7

3. Descriptive Statistics

4. Mode

� The mode is the value that appears most often.

� It is not uncommon to have bi-mode or multi-mode distributions.

� In our example

Step 1: rank data set � x = {2, 3, 3, 4, 4, 4, 5, 5, 6, 7}

Step 2: mode = 4 (appears 3 times)

Note: the mode is not influenced by outliers!

8

Page 3: Descriptive Statistics

3. Descriptive Statistics

5. Standard Deviation (σ, s) and Variance (σ2, s2)

Measures the spread/variability of the data. The standard deviation can be understood as the average distance between any value and the mean.

Std dev of the population:

Std dev of the sample:

9

N

xN

ii

2

1

)(∑=

−=

µσ

1

)( 2

1

−=∑

=

N

xs

N

ii µ

Note, “N-1” instead of “N”.

Reason complex, but essentially so that the std dev estimate is (on average) not

biased. (You will likely hear more

on this in intermediate

stats.)

3. Descriptive Statistics

5. Standard Deviation (σ, s) and Variance (σ2, s2)

In our example, we don’t have data of the whole population, but only data of a sample of it. Hence, we can only compute the sample std. dev.

std dev (x) = 1.5

10

1

)( 2

1

−=∑

=

N

xs

N

ii µ

3. Descriptive Statistics

5. Standard Deviation (σ, s) and Variance (σ2, s2)

Note that the standard deviation is essential to know. The 2 samples {49,50,51} and {0,50,100} have the same mean and the same median, but the samples are very different.

� Without some measure of spread, a data summary is incomplete.

11

3. Descriptive Statistics

5. Standard Deviation (σ, s) and Variance (σ2, s2)

The Variance is simply the squared standard deviation.

Population variance =

Sample variance =

12

N

xN

ii

2

12

)(∑=

−=

µσ

1

)( 2

12

−=∑

=

N

xs

N

ii µ

Page 4: Descriptive Statistics

3. Descriptive Statistics

6. Percentiles & Quartiles

Measures of relative standing. It is the %value of how many observations fall below a specific observation. � Example: if your SAT score is in the 90th percentile, then

you did better than 90% of the other students.

� The median is the 50th percentile (center value of ordered list)

� To find the kth percentile (say, k = 90)

� Step 1: rank the sample. x = {2, 3, 3, 4, 4, 4, 5, 5, 6, 7}

� Step 2: multiply k times N. (Here: 0.9*10=9)

� Step 3: count to the “k times N”th value (9th value) in the ranked

sample � 6 is the 90th percentile in our data set.

13

14

3. Descriptive Statistics

6. Percentiles and Quartiles (cont’d)

� The 1st quartile (Q1) is the 25th percentile.

� The 3rd quartile (Q3) is the 75th percentile.

� You can get Q1 and Q3 by taking the median of the first half and the second half of the sample:

{2, 3, 3, 4, 4, 4, 5, 5, 6, 7}

Median: (4+4)/2 = 4

Q1: 3 Q3: 5

15

3. Descriptive Statistics

7. The IQR and the Outlier-Rule

� Interquartile Range (IQR) = Q3 – Q1

� Outlier Rule

� Any values

above (Q3 + 1.5*IQR) = upper bound

and below (Q1 – 1.5*IQR) = lower bound

are called “suspected outliers.”

� In our previous example (last slide):

� Upper bound = 5 + 1.5 (5-3) = 8

� Lower bound = 3 – 1.5 (5-3) = 0

� no suspected outliers in our sample.

3. Descriptive Statistics

8. Five-Number Summary

� Minimum, Q1, Median, Q3, Maximum

� Preferred to Mean & Std.dev/Variance if the distribution is skewed by outliers.

16

Page 5: Descriptive Statistics

3. Descriptive Statistics

9. Which Descriptive Stats to use?

� If distribution is symmetric with no strong outliers

� Mean, Std.Dev/Variance

� If distribution is asymmetric and/or with strong outliers

� 5-Number Summary

17

Any Questions?

18

3. Descriptive Statistics