making inferences, aka hypothesis testing

Making Inferences, AKA Hypothesis Testing

Assignment 2 and 3

• You should have received feedback on assignment 2.– Great job everyone.

• Please send everything to both me and Lamiya.

• Mofiz Haque please stop by I have a question about your email address.

• Assignment 3 is assigned today.

So Far

• You know how to describe variables:– Conceptually with a taxonomy– Graphically– Numerically

• You know how to describe some distributions:– Empirically– Theoretically

• You have been exposed to two statistical packages to help you do these tasks:– R with Rcmdr– SAS Enterprise Guide

So Far

• Probability is scored between 0 and 1.

• Area under a curve or heights of bars represent probability.

0 1

Impossible Certain

0.5

As likely as not

Unlikely to occur Likely to occur

From Last Time

• I talked about when a variable (really its distribution) is (theoretically) normally distributed, it is described by only two parameters (the first two moments of the mean), the mean and standard deviation.

• When you are taking sample means (with more than one observation in the mean) and you plot the means, the density looks normally distributed. This fact that the sampling distribution of means looks normal (irrespective of the original distribution) is called the Central Limit Theorem.

Moving On

• The next steps are to describe other types of distributions and figure out how to quantify just how unusual a weird statistic from your sample actually actually is.

• You are not always going to be making generalizations about comparing means.– Comparing variability (variance) is hugely

important.

Variability of Sample Means

• Recall that the number of people (observations) in each sample mattered a lot in determining whether the sampling distribution looked normal. – If you have a decent size sample (the number

of people in each sample), it is hard to get very extreme values out of a normal sampling distribution because the extremely big values tend to cancel out the extremely small values.

Actual Scores

scores

Fre

quen

cy

300 400 500 600 700

050

015

00

Bunch of Means sample N = 5

bunchOfMeans

Fre

quen

cy

300 400 500 600 700

020

060

0

Bunch of Means sample N = 20

bunchOfMeans20

Fre

quen

cy

300 400 500 600 700

020

060

0The distribution of the means from sample size of 5 is narrower than the original values (and bell shaped).

The distribution of the means from sample size of 20 is narrower still (and bell-shaped).

Variability Between Samples

• The width of the sampling distribution of the means got narrower and narrower as the size of each of the samples increased.

• The variability within a sample (of size 1) is called the standard deviation.

• The variability across the means when you have samples bigger than size 1 is called the standard error.

Standard Error

• The formula for the standard error of the means is just the sample standard deviation formula with a tweak to indicate the impact of the sample size.

• The SE plays a huge role in all inferences. You need it to determine what is an odd sample.

Size Sample

SDSEMean

Standard Error Formula

• As you move through the year, you will meet many formulas for standard errors. – If you are testing to see if there is a difference

between two groups, you use a slightly different formula.

– If you are working with the distribution of counts of events happening or not happening in many trials (yes/no getting pregnant on many attempts), the SE formula is different but it plays the same role. It helps you determine what is an unusual value.

Probability Functions

• Some people are entertained while others are horrified at the prospect of having to do calculus to figure out the area under the curve corresponding to what made an unusual sample. Happily, you don’t have to. You can use the probability functions in a language like SAS or R.

Quantiles

• Say you want to know what quantile corresponds to a standard normal value.

• The standard normal is where you have rescaled your values so they are measured with a mean of 0 and standard deviation of 1.

thingy ~ N (0,1)

• For example, you may want to know what value cuts off the most extremely large 5% of a standard normal curve.

Z-scores for Percentages

What percentiles?

• You are far more likely to want to know what percentile your actual scores correspond to. To get those values, you will use the CDF function (Cumulative Density Function).

z

freq

uenc

y

-4 -3 -2 -1 0 1 2 3

050

100

150

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

z

Pro

babi

lity

dens

ity

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

z

Pro

babi

lity

0.0 0.2 0.4 0.6 0.8 1.0

-2-1

01

2

p

Qua

ntile

(Z

)

Null Hypothesis

• When you design an experiment, you typically propose a hypothesis indicating that nothing interesting is going on. – For example, if you expect a drug and a placebo to

act the same way, your null hypothesis is that the average difference is 0.

– You reject the null hypothesis if your sample is too far out in the tails of the null distribution.

– You typically set up this target (dummy hypothesis) and hope your data does not look like this.

Hoe Hoe Hoe

• The null hypothesis is typically written H0. That is pronounced H-zero or H-not. Don’t call it “hoe”.

• The alternative hypothesis is typically written H1 or HA.

What could possibly go wrong?

• When you do an experiment you come up with a hypothetical population mean and SD and have a computer calculate sampling distribution of the means (for your sample size). You can then test to see if your data is compatible or weird giving the population mean and standard error.

• Call this distribution “the null distribution” because it is what you expect and nothing interesting is going on if you find it is true.

• What could possibility go wrong?


• Your guess at the population mean was right but you could get a sample by chance (poor luck) that was from way out in the tails of the distribution. – The first thing that could go wrong is called the Type I (one)

Error.

• Things could be really bad and your guess about the population mean was wrong but you get a sample that is compatible with your original hypothesis that is not in agreement with reality (this 2nd thing that could go wrong is called the Type II (two) Error.– You won’t notice that the distribution is actually centered around

an alternative mean and has an alternative distribution.

Think of…

Pascal’s Wager

Your DecisionThe TRUTH

God Exists God Doesn’t Exist

Reject GodBIG MISTAKE Correct

Accept God Correct—Big Pay Off

MINOR MISTAKE

Type I and Type II Error in a Box

Your Statistical Decision

True State of Null Hypothesis

H0 True H0 False

Reject H0 Type I error (α) Correct

Do not reject H0

Correct Type II Error (β)

Analogy to Quality Control

• In my humble opinion, people typically worry too much about the Type I error. The probability that this error happens is called the p-value and this is called the α (alpha) level.

• Failing to realize that the data should be described by an alternative distribution is called the β (beta) error.

Hypothesis Testing Analogies

Is a real difference Is no real difference

Reject Null No Error (true positive) Type 1 error

Fail to reject Type 2 error No Error (true negative)

Is a really caner No cancer

High PSA No error (true positive) False positive

Normal PSA False negative No error (true negative)

Is a really caner No cancer

Positive image No error (true positive) False positive

Negative False negative No error (true negative)

Low metastasis potential

Highly aggressive breast cancer

Power 1-

Sensitivity

Specificity

1-

A Tale About Two Tails

• If you want to test to see if your data is incompatible with a null hypotheses, you specify just how weird it needs to be to be called weird. That is, you specify the alpha level. Typically you say a sample statistic that could happen 1 in 20 times is too uncommon to say it happened by chance alone.

• For example, you have a hypothetical mean and if your sample mean is very high or very low relative to it, you say it is too odd and you reject the null hypothesis.

• Using the code from earlier in the lecture, you could figure out the probability of a value.

One-Tailed

• Typically you want to know if your value differs from the population value. In other cases (very rarely), you may be interested if and only if the value is greater than the population value. In yet other cases (very rarely), you may be interested if and only if the value is less than the population value.

• The test of a difference is a two-tailed test because the value could be unusually high or low. The test of “more than” (as opposed to “different”) is a one-tailed test. The test of “less than” is also a one tailed test.

Splitting Tails

• If you do a two-sided test and you say a sample is odd if it occurs only 1/20 times, you need to split that .05 percent of the weirdness into both tails. So you cut the distribution such that a sample which is in the upper .025 or lower .025 of the distribution is grounds for rejecting the null hypothesis. But if you say that you are only interested in whether this sample is greater than the hypothetical mean, you can shove all .05 into one tail and it is relatively easy to find a weird sample.

Some Moron Tails…

• The inexact use of Fisher's Exact Test in six major medical journals by McKinney et al., JAMA Vol. 261 No. 23, June 16, 1989– We reviewed the use of Fisher's Exact Test in 71

articles published between 1983 and 1987 in six medical journals. Thirty-three of 56 selected articles did not specify use of a one- or two-tailed test, and 12 (36%) of these actually used the one-tailed test. Five (42%) of these 12 articles contained at least one table in which the standard significance level of P less than .05 was no longer met when a two-tailed analysis was run instead.

Extreme Caution

• If an outcome could biologically be either above or below a population mean, do the two sided tests. There are terrifying scenarios that begin with a standard of care that is so thought to be so good that a new (less invasive) treatment could only be worse. So a researcher does a one-sided test to see if the new treatment is worse. In reality, the gold standard is harmful (pure oxygen to neonates). Therefore, you do not see a statistically significant difference. In other words, they would fail to see the harmful effect of the treatment as statistically significant.


• Recall that in addition to the Type I error caused by having an unusual sample that really came from the null distribution, you could get a value from the alternate distribution that was compatible with the null hypothesis.

Alpha and Beta

• Alpha and Beta errors are intimately connected in testing hypotheses and van Belle does not make this clear enough. An alternate presentation can be found in Normal and Streiner’s Biostatistics: The Bare Essentials. If you are math phobic, I highly recommend the book.

Blood Sodium Example

• The story begins with a measure of blood sodium with a known population mean of 140 mmol/L and a standard deviation of 2. In the study, blood measures are taken on 25 people and the mean is 137.5. The question is “does it look unlikely that the sample mean came from a population with a mean of 140 or do you want to conclude that the true population mean is different?”

• What do you do?

Steps to the Comparison

• What is the standard error?• How many standard errors away from the mean

is this sample? • If testing for a difference between the groups at

the alpha .05 level, what is the cut point in z-units?

• What is the cut point in the original units?• What is the power? • What is the beta error?• What happens if you use a smaller sample?

The SE

• The Standard Error of the mean:

• The Z score:

Size Sample

SDSEMean

n

xz/

)(

Calculating a Z Score

It is a darn unusual sample if the population mean is 140.

The Actual Cut Point

0.0

0.2

0.4

0.6

0.8

x

De

nsi

ty

136 138 140 142

Sample size = 25

• What happens when your sample size was smaller?

Sample size = 4

Running the Analyst

Pick Your Study Design

Fill in the Blanks

Get Results as a Table

…or as a picture

Other Software Packages

• S-Plus can easily produce information on power:

Best Guesses

• So far I talked about making judgments regarding when a sample is compatible with a distribution. Another very important task is making a guess about a population value and specifying the precision of your guess.

• This is the process of building confidence intervals.

100% Confidence Intervals

• Say I do a sample of ages of Stanford undergraduates. My mean from the sample is 20 years old. That is my point estimate of the population mean. I know that the true mean is not exactly 20. So I give myself some wiggle room by saying 20 plus or minus something. That range is called the confidence interval.

• I want to be 100% certain that my guestimated range includes the true population mean, so I say age 20 +/- 90 years.

Can I do better than that?

• The true population mean is going to be within the range of 0 and 110 years old. So, I have built a 100% confidence interval.

• Say I get a sample of 25 undergrads, calculate their mean age, add +/- 10 years and call that the confidence interval. The population mean will or will not be inside of the range. So reality is either yes or no. How do I specify a probability here?

• You want specify a range that when you do the sampling experiment many times, you will usually capture the true value within the guestimated range. That is the typical definition of a confidence interval.

• You use the sampling distribution we have been talking about to calculate those values.

making inferences, aka hypothesis testing

Documents

sample size

sample of size

decent size sample

odd sample

original distribution

normal sampling distribution

distribution of counts

standard errors