descriptive statistics for numeric variables types of measures: measures of location measures of...

52
Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Upload: johnathan-dennis

Post on 13-Dec-2015

244 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Descriptive Statistics for Numeric Variables

Types of Measures:measures of locationmeasures of spread

measures of relative spreadingmeasures of shape

Page 2: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

What to describe?

• What is the “location” or “center” of the data? (“measures of location”)

• How do the data vary? (“measures of variability”)

Page 3: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Measures of Location

• Mean

• Median

• Mode

Page 4: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Mean

• Another name for average.

• If describing a population, denoted as , the greek letter i.e. “mu”. (PARAMETER)

• If describing a sample, denoted as , called “x-bar”. (STATISTIC)

• Appropriate for describing measurement data.

• Seriously affected by unusual values called “outliers”.

x

Page 5: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Calculating Sample Mean

Formula:

That is, add up all of the data points and divide by the number of data points.

Data (# ER arrivals in 1 hr): 2 8 3 4 1

Sample Mean = (2+8+3+4+1)/5

= 3.6 arrivals

n

xx

n

ii

1

Page 6: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Median

• Another name for 50th percentile.

• Appropriate for describing measurement data.

• “Robust to outliers,” that is, not affected much by unusual values.

Page 7: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Calculating Sample Median

Order data from smallest to largest.

If odd number of data points, the median is the middle value.

Data (# ER arrivals in 1 hr.): 2 8 3 4 1

Ordered Data: 1 2 3 4 8

Median

Page 8: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Calculating Sample Median

Order data from smallest to largest.

If even number of data points, the median is the average of the two middle values.

Data (# ER arrivals in 1 hr.): 2 8 3 4 1 8

Ordered Data: 1 2 3 4 8 8

Median = (3+4)/2 = 3.5

Page 9: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Mode

• The value that occurs most frequently.

• One data set can have many modes.

• Appropriate for all types of data, but most useful for categorical data or discrete data with only a few number of possible values.

Page 10: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

In JMP: Heart Attack Data• Select Analyze Distribution (JMP Demo

)

Page 11: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

In JMP: Heart Attack Data

Sample sizen = 45 (don’t use N)

Page 12: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

The most appropriate measure of location depends on …

the shape of the data’s distribution. e.g.

Page 13: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Most appropriate measure of location

• Depends on whether or not data are “symmetric” or “skewed”.

• Depends on whether or not data have one (“unimodal”) or more (“multimodal”) modes.

Page 14: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)

Page 15: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)

The mean and the median are approximately the same as this distribution is nearly symmetric.

Slight right skewness – see measures of shape.

Page 16: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Heights of College Students - Symmetric and Bimodal

Page 17: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Heights of College Students - Symmetric and Bimodal

Variable n Mean Median StdDev Males 84 70.048 70.000 3.030 Females 89 64.798 65.000 2.877 All 176 67.313 67.000 4.017

Variable SE Mean Min Max Q1 Q3Males 0.331 63.0 76.0 68.0 72.0Females 0.305 56.0 77.0 63.0 67.0All 0.303 56.0 77.0 64.0 70.0

Page 18: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Heights of College Students - Symmetric and Bimodal

Mean height for females Mean height for males

Overall mean

Page 19: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Systolic Volume for Heart Attack Patients - Skewed Right

•Sample mean (79.42) is substantially larger than the sample median (67.00), median is “better” measure of average.

•Skewness statistic is > 1 suggesting pronounced right skewness (see measures of shape).

Page 20: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Time Until Outcome for Heart Attack Patients - Skewed Left

•Sample mean (112.4) is substantially smaller than the sample median (138.00), median is “better” measure of average.

•Skewness statistic is < - 1 suggesting pronounced left skewness (see measures of shape)

Page 21: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Choosing Appropriate Measure of Location

• If data are symmetric, the mean, median, and mode will be approximately the same.

• If data are multimodal, report the mean, median and/or mode for each subgroup.

• If data are skewed, report the median.

Page 22: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Measures of Variability

• Range

• Interquartile range (IQR)

• Variance and standard deviation

• Coefficient of variation (CV)

Page 23: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Range

• The difference between largest and smallest data point.

• Highly affected by outliers.

• Best for symmetric data with no outliers.

Page 24: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)

Page 25: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)

Max. = 93 (mmoles/l)

Min. = 38 (mmoles/l)

Range = 93 – 38 = 55 (mmoles/l)

Page 26: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Interquartile range

• The difference between the “third quartile” (75th percentile) and the “first quartile” (25th percentile). So, the “middle-half” of the values.

• IQR = Q3-Q1

• Robust to outliers or extreme observations.

• Works well for skewed data.

Page 27: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Systolic Volume for Heart Attack Patients - Skewed Right

•Q3 = 92.50 Q1 = 52.50 IQR = 92.50 – 52.50 = 40.0

•The range of the middle 50% of systolic volumes is 40 mmoles/l.

Q3

Q1

Page 28: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Variance

• If measuring variance of population, denoted by 2 (“sigma-squared”).

• If measuring variance of sample, denoted by s2

(“s-squared”).

• Measures average squared deviation of data points from their mean.

• Highly affected by outliers. Best for symmetric data.

• Problem is units are squared.

Page 29: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Formula for the Sample Variance (s2)

1

)(1

2

2

n

xxs

n

ii

This is nearly (if not for the n-1 in the denominator) the average squared deviation from the sample mean for our observed data.

Page 30: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Standard deviation

• Sample standard deviation is square root of sample variance, and so is denoted by s.

• Units are the original units.

• Measures “average” deviation of data points from their mean.

• Also, highly affected by outliers.

Page 31: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Sleep Study: Comparing Time to Fall Asleep of Smokers vs. Non-smokers

What differences in distribution of time to fall asleep do we see when comparing the smokers to non-smokers in this study?

1) Typical time to fall asleep is 20-21 minutes for both populations.

2) IQR for smokers is twice that for non-smokers.

3) Distribution for non-smokers is approx. normal, not so for smokers.

Page 32: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Sleep Study: Comparing Time to Fall Asleep of Smokers vs. Non-smokers

Smokers Non-smokers

s = 3.69 minutes > s = 2.28 minutes

IQR = 7.05 minutes > IQR = 3.00 minutes

Page 33: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Empirical Rule – The standard deviation and the normal distribution

For unimodal, moderately symmetrical, sets of data approximately:

• 68% of observations lie within 1 standard deviation of the mean.

• 95% of observations lie within 2 standard deviations of the mean.

i.e. Normally Distributed Data

Page 34: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

x

The Empirical Rule

Page 35: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

x - s x x + s

68% within1 standard deviation

34% 34%

The Empirical Rule

Page 36: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

x - 2s x - s x x + 2sx + s

68% within1 standard deviation

34% 34%

95% within 2 standard deviations

The Empirical Rule

13.5% 13.5%

Page 37: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

x - 3s x - 2s x - s x x + 2s x + 3sx + s

68% within1 standard deviation

34% 34%

95% within 2 standard deviations

99.7% of data are within 3 standard deviations of the mean

The Empirical Rule

0.1% 0.1%

2.4% 2.4%

13.5% 13.5%

Page 38: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Application of Empirical Rule – Medical Lab Tests

When you have blood drawn and it is screened for different chemical levels, any results two standard deviations below or two standard deviations above the mean for healthy individuals will get flagged as being abnormal.

Example: For potassium, healthy individuals have a mean level 4.4 meq/l with a SD of .45 meq/l

Individuals with levels outside the range :4.4 – 2(.45) to 4.4 + 2(.45)

3.5 meq/l to 5.3 meq/l would be flagged as having abnormal potassium.

sx 2

Page 39: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Coefficient of Variation (CV)

• Ratio of sample standard deviation to sample mean multiplied by 100.

• Measures relative variability, that is, variability relative to the magnitude of the data.

• Unitless, so good for comparing variation between two groups and for comparing variability of measurements in completely different scales and/or units.

%100x

sCV

Page 40: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Heart Attack Data: Which volume measure has more variation, systolic or diastolic?

SYSVOLCV = 39.95/79.42

= 50.3%

DIAVOL

CV = 48.79/158.93

= 30.7%

Thus systolic volume has the greater variation in our sample on the basis of the CV.

Page 41: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

The most appropriate measure of variability depends on …

the shape of the data’s distribution.

Page 42: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Choosing Appropriate Measure of Variability

• If data are symmetric, with no serious outliers, use range and standard deviation.

• If data are skewed, and/or have serious outliers, use IQR.

• If comparing variation across two variables, use coefficient of variation if the variables are in different units and/or scales. If the scales and units are roughly the same direct comparison of the standard deviation is fine.

Page 43: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Measures of Shape – Skewness and KurtosisStatistical software packages will give some

measure of skewness and kurtosis for a given numeric variable.

Skewness measures departure from symmetry and is usually characterized as being left or right skewed as seen previously.

Kurtosis measures “peakedness” of a distribution and comes in two forms, platykurtosis and leptokurtosis.

Page 44: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Skewness

Pearson’s Skewness Coefficient

Fisher’s Measure of Skewness has a complicated formula but most software packages compute it.

Fisher’s Skewness > 1.00 moderate right skewness > 2.00 severe right skewness

Fisher’s Skewness < -1.00 moderate left skewness

< -2.00 severe right skewness

s

medianxSkewness

If skewness < -.20 severe left skewness

If skewness > +.20 severe right skewness

Page 45: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Skewness

Skewness = -.5786

Suggesting slight left skewness.

Skewness = 1.944

Suggesting strong right skewness.

Page 46: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

KurtosisMeasures peakedness of a distribution.

Normal distribution has Kurtosis = 0.

Leptokurtotic distributions are more peaked than normal with fatter tails,Kurtosis > 0

Platykurtotic distributions are less peaked (squashed normal) than normal,Kurtosis < 0

Page 47: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

KurtosisExample 1: Blood pH levels for subjects in right heart catheter study. Here we see slightly left skewed (-1.22) but markedly leptokurtotic (3.49) distribution. The reference normal curve has been added and blue curve is the density estimate from the data.

Page 48: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Example 2: Kurtosis

Times to fall asleep for non-smokers are approx. normal as both skewness and kurtosis are close to 0.

Times to fall asleep for smokers are fairly platykurtotic.

Kurtosis = -1.50

Page 49: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Transformations to Improve Normality (removing skewness)

Many statistical methods require that the numeric variables you are working with have an approximately normal distribution.

Reality is that this is often times not the case. One of the most common departures from normality is skewness, in particular, right skewness.

Page 50: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

2V

UP

) as thisofthink ( log 010 VV

V1

21V

3V

4VBiggerImpact

BiggerImpact

3 V

2 V

. . .

.

. . .

.V Middle rung:

No transformation( = 1)

Middle rung:No transformation

( = 1)

DOWN

Here V represents our variable of interest. We are going to consider this variable raised to a power , i.e. V

Here V represents our variable of interest. We are going to consider this variable raised to a power , i.e. V

We go up the ladder to remove left skewness and down the ladder to remove right skewness.

We go up the ladder to remove left skewness and down the ladder to remove right skewness.

Right skewed

Left skewed

Tukey’s Ladder of Powers

Page 51: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Tukey’s Ladder of Powers

• To remove right skewness we typically take the square root, cube root, logarithm, or reciprocal of a the variable etc., i.e. V.5, V.333, V0, V-1, etc.

• To remove left skewness we raise the variable to a power greater than 1, such as squaring or cubing the values, i.e. V2, V3, etc.

Page 52: Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of relative spreading measures of shape

Removing Right Skewness

LI-PDP LIPDP 3 LIPDP )(log10 LIPDP

Example: PDP-LI levels for cancer patients

In the log base 10 scale the PDP-LI values are approximately normally distributed.