pitfalls of hypothesis testing + sample size calculations

68
Pitfalls of Hypothesis Testing + Sample Size Calculations

Upload: melinda-colver

Post on 13-Dec-2015

243 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pitfalls of Hypothesis Testing + Sample Size Calculations

Pitfalls of Hypothesis Testing + Sample Size Calculations

Page 2: Pitfalls of Hypothesis Testing + Sample Size Calculations

Hypothesis Testing

The Steps:1.     Define your hypotheses (null, alternative)2.     Specify your null distribution 3.     Do an experiment4.     Calculate the p-value of what you observed 5.     Reject or fail to reject (~accept) the null

hypothesis

Follows the logic: If A then B; not B; therefore, not A.

Page 3: Pitfalls of Hypothesis Testing + Sample Size Calculations

Summary: The Underlying Logic of hypothesis tests…

Follows this logic:

Assume A.

If A, then B.

Not B.

Therefore, Not A.

But throw in a bit of uncertainty…If A, then probably B…

Page 4: Pitfalls of Hypothesis Testing + Sample Size Calculations

Error and Power Type-I Error (also known as “α”):

Rejecting the null when the effect isn’t real.

Type-II Error (also known as “β “):

Failing to reject the null when the effect is real.

POWER (the flip side of type-II error: 1- β):

The probability of seeing a true effect if one exists.

Note the sneaky conditionals…

Page 5: Pitfalls of Hypothesis Testing + Sample Size Calculations

Think of…

Pascal’s Wager

Your DecisionThe TRUTH

God Exists God Doesn’t Exist

Reject GodBIG MISTAKE Correct

Accept God Correct—Big Pay Off

MINOR MISTAKE

Page 6: Pitfalls of Hypothesis Testing + Sample Size Calculations

Type I and Type II Error in a box

Your Statistical Decision

True state of null hypothesis

H0 True(example: the drug doesn’t work)

H0 False(example: the drug works)

Reject H0(ex: you conclude that the drug works)

Type I error (α) Correct

Do not reject H0(ex: you conclude that there is insufficient evidence that the drug works)

Correct Type II Error (β)

Page 7: Pitfalls of Hypothesis Testing + Sample Size Calculations

Error and Power

Type I error rate (or significance level): the probability of finding an effect that isn’t real (false positive).

If we require p-value<.05 for statistical significance, this means that 1/20 times we will find a positive result just by chance.

Type II error rate: the probability of missing an effect (false negative).

Statistical power: the probability of finding an effect if it is there (the probability of not making a type II error).

When we design studies, we typically aim for a power of 80% (allowing a false negative rate, or type II error rate, of 20%).

Page 8: Pitfalls of Hypothesis Testing + Sample Size Calculations

Pitfall 1: over-emphasis on p-values

Clinically unimportant effects may be statistically significant if a study is large (and therefore, has a small standard error and extreme precision).

Pay attention to effect size and confidence intervals.

Page 9: Pitfalls of Hypothesis Testing + Sample Size Calculations

Example: effect size

A prospective cohort study of 34,079 women found that women who exercised >21 MET hours per week gained significantly less weight than women who exercised <7.5 MET hours (p<.001)

Headlines: “To Stay Trim, Women Need an Hour of Exercise Daily.”

Physical Activity and Weight Gain Prevention. JAMA 2010;303:1173-1179.

Page 10: Pitfalls of Hypothesis Testing + Sample Size Calculations

Copyright restrictions may apply.

Lee, I. M. et al. JAMA 2010;303:1173-1179.

Mean (SD) Differences in Weight Over Any 3-Year Period by Physical Activity Level, Women's Health Study, 1992-2007a

Page 11: Pitfalls of Hypothesis Testing + Sample Size Calculations

•What was the effect size? Those who exercised the least 0.15 kg (.33 pounds) more than those who exercised the most over 3 years.

•Extrapolated over 13 years of the study, the high exercisers gained 1.4 pounds less than the low exercisers!

•Classic example of a statistically significant effect that is not clinically significant.

Page 12: Pitfalls of Hypothesis Testing + Sample Size Calculations

A picture is worth…

Page 13: Pitfalls of Hypothesis Testing + Sample Size Calculations

A picture is worth…

But baseline physical activity should predict weight gain in the first three years…do those slopes look different to you?

Authors explain: “Figure 2 shows the trajectory of weight gain over time by baseline physical activity levels. When classified by this single measure of physical activity, all 3 groups showed similar weight gain patterns over time.”

Page 14: Pitfalls of Hypothesis Testing + Sample Size Calculations

Another recent headlineDrinkers May Exercise More Than TeetotalersActivity levels rise along with alcohol use, survey

shows

“MONDAY, Aug. 31 (HealthDay News) -- Here's something to toast: Drinkers are often exercisers”…

“In reaching their conclusions, the researchers examined data from participants in the 2005 Behavioral Risk Factor Surveillance System, a yearly telephone survey of about 230,000 Americans.”…

For women, those who imbibed exercised 7.2 minutes more per week than teetotalers. The results applied equally to men…

Page 15: Pitfalls of Hypothesis Testing + Sample Size Calculations

Pitfall 2: association does not equal causation

Statistical significance does not imply a cause-effect relationship.

Interpret results in the context of the study design.

Page 16: Pitfalls of Hypothesis Testing + Sample Size Calculations

A significance level of 0.05 means that your false positive rate for one test is 5%.

If you run more than one test, your false positive rate will be higher than 5%.

Pitfall 3: multiple comparisons

Page 17: Pitfalls of Hypothesis Testing + Sample Size Calculations

data dredging/multiple comparisons

In 1980, researchers at Duke randomized 1073 heart disease patients into two groups, but treated the groups equally.

Not surprisingly, there was no difference in survival. Then they divided the patients into 18 subgroups based

on prognostic factors. In a subgroup of 397 patients (with three-vessel disease

and an abnormal left ventricular contraction) survival of those in “group 1” was significantly different from survival of those in “group 2” (p<.025).

How could this be since there was no treatment?(Lee et al. “Clinical judgment and statistics: lessons from a simulated randomized trial in coronary artery disease,” Circulation, 61: 508-515, 1980.)

Page 18: Pitfalls of Hypothesis Testing + Sample Size Calculations

The difference resulted from the

combined effect of small imbalances in the subgroups

Multiple comparisons

Page 19: Pitfalls of Hypothesis Testing + Sample Size Calculations

Multiple comparisons By using a p-value of 0.05 as the criterion for

significance, we’re accepting a 5% chance of a false positive (of calling a difference significant when it really isn’t).

If we compare survival of “treatment” and “control” within each of 18 subgroups, that’s 18 comparisons.

If these comparisons were independent, the chance of at least one false positive would be…

60.)95(.1 18

Page 20: Pitfalls of Hypothesis Testing + Sample Size Calculations

Multiple comparisons

With 18 independent comparisons, we have 60% chance of at least 1 false positive.

Page 21: Pitfalls of Hypothesis Testing + Sample Size Calculations

Multiple comparisons

With 18 independent comparisons, we expect about 1 false positive.

Page 22: Pitfalls of Hypothesis Testing + Sample Size Calculations

Results from a previous class survey…

My research question was to test whether or not being born on odd or even days predicted anything about people’s futures.

I discovered that people who born on odd days got up later and drank more alcohol than people born on even days; they also had a trend of doing more homework (p=.04, p<.01, p=.09).

Those born on odd days woke up 42 minutes later (7:48 vs. 7:06 am); drank 2.6 more drinks per week (1.1 vs. 3.7); and did 8 more hours of homework (22 hrs/week vs. 14).

Page 23: Pitfalls of Hypothesis Testing + Sample Size Calculations

Results from Class survey…

I can see the NEJM article title now…

“Being born on odd days predisposes you to alcoholism and laziness, but makes you a better med student.”

Page 24: Pitfalls of Hypothesis Testing + Sample Size Calculations

Results from Class survey…

Assuming that this difference can’t be explained by astrology, it’s obviously an artifact!

What’s going on?…

Page 25: Pitfalls of Hypothesis Testing + Sample Size Calculations

Results from Class survey… After the odd/even day question, I

asked 25 other questions… I ran 25 statistical tests

(comparing the outcome variable between odd-day born people and even-day born people).

So, there was a high chance of finding at least one false positive!

Page 26: Pitfalls of Hypothesis Testing + Sample Size Calculations

P-value distribution for the 25 tests…

Recall: Under the null hypothesis of no associations (which we’ll assume is true here!), p-values follow a uniform distribution…

My significant p-values!

Page 27: Pitfalls of Hypothesis Testing + Sample Size Calculations

Compare with…

Next, I generated 25 “p-values” from a random number generator (uniform distribution). These were the results from three runs…

Page 28: Pitfalls of Hypothesis Testing + Sample Size Calculations

In the medical literature… Researchers examined the relationship between

intakes of caffeine/coffee/tea and breast cancer overall and in multiple subgroups (50 tests)

Overall, there was no association Risk ratios were close to 1.0 (ranging from 0.67 to

1.79), indicated protection (<1.0) about as often harm (>1.0), and showed no consistent dose-response pattern

But they found 4 “significant” p-values in subgroups: coffee intake was linked to increased risk in those with benign

breast disease (p=.08) caffeine intake was linked to increased risk of

estrogen/progesterone negative tumors and tumors larger than 2 cm (p=.02)

decaf coffee was linked to reduced risk of BC in postmenopausal hormone users (p=.02)

Ishitani K, Lin J, PhD, Manson JE, Buring JE, Zhang SM. Caffeine consumption and the risk of breast cancer in a large prospective cohort of women. Arch Intern Med. 2008;168:2022-2031.

Page 29: Pitfalls of Hypothesis Testing + Sample Size Calculations

Distribution of the p-values from the 50 tests

Likely chance findings!

Also, effect sizes showed no consistent pattern.

The risk ratios:

-were close to 1.0 (ranging from 0.67 to 1.79)

-indicated protection (<1.0) about as often harm (>1.0)

-showed no consistent dose-response pattern.

Page 30: Pitfalls of Hypothesis Testing + Sample Size Calculations

Hallmarks of a chance finding: Analyses are exploratory Many tests have been performed but only a

few are significant The significant p-values are modest in size

(between p=0.01 and p=0.05) The pattern of effect sizes is inconsistent The p-values are not adjusted for multiple

comparisons

Page 31: Pitfalls of Hypothesis Testing + Sample Size Calculations

Conclusions

Look at the totality of the evidence.

Expect about one marginally significant p-value (.01<p<.05) for every 20 tests run.

Be wary of unplanned comparisons (e.g., subgroup analyses).

Page 32: Pitfalls of Hypothesis Testing + Sample Size Calculations

Pitfall 4: high type II error (low statistical power) Results that are not statistically significant should

not be interpreted as "evidence of no effect,” but as “no evidence of effect”

Studies may miss effects if they are insufficiently powered (lack precision).

Example: A study of 36 postmenopausal women failed to find a significant relationship between hormone replacement therapy and prevention of vertebral fracture. The odds ratio and 95% CI were: 0.38 (0.12, 1.19), indicating a potentially meaningful clinical effect. Failure to find an effect may have been due to insufficient statistical power for this endpoint.

Ref: Wimalawansa et al. Am J Med 1998, 104:219-226.

Page 33: Pitfalls of Hypothesis Testing + Sample Size Calculations

Example “There was no significant effect of

treatment (p =0.058), nor treatment by velocity interaction (p = 0.19), indicating that the treatment and control groups did not differ in their ability to perform the task.”

P-values >.05 indicate that we have insufficient evidence of an effect; they do not constitute proof of no effect.

Page 34: Pitfalls of Hypothesis Testing + Sample Size Calculations

Smoking cessation trial

Weight-concerned women smokers were randomly assigned to one of four groups: Weight-focused or standard

counseling plus bupropion or placebo Outcome: biochemically confirmed

smoking abstinence

Levine MD, Perkins KS, Kalarchian MA, et al. Bupropion and Cognitive Behavioral Therapy for Weight-Concerned Women Smokers. Arch Intern Med 2010;170:543-550.

Page 35: Pitfalls of Hypothesis Testing + Sample Size Calculations

Rates of biochemically verified prolonged abstinence at 3, 6, and 12 months from a four-arm randomized trial of smoking cessation*

Months after quit

target date

Weight-focused counseling Standard counseling group

Bupropiongroup

(n=106)

Placebo group (n=87)

P-value, bupropion

vs. placebo

Bupropiongroup

(n=89)

Placebo group

(n=67)

P-value, bupropion

vs. placebo

3 41% 18% .001 33% 19% .07

6 34% 11% .001 21% 10% .08

12 24% 8% .006 19% 7% .05

The Results…

Page 36: Pitfalls of Hypothesis Testing + Sample Size Calculations

Rates of biochemically verified prolonged abstinence at 3, 6, and 12 months from a four-arm randomized trial of smoking cessation*

Months after quit

target date

Weight-focused counseling Standard counseling group

Bupropiongroup

(n=106)

Placebo group (n=87)

P-value, bupropion

vs. placebo

Bupropiongroup

(n=89)

Placebo group

(n=67)

P-value, bupropion

vs. placebo

3 41% 18% .001 33% 19% .07

6 34% 11% .001 21% 10% .08

12 24% 8% .006 19% 7% .05

The Results…

Counseling methods appear equally effective in the placebo group.

Page 37: Pitfalls of Hypothesis Testing + Sample Size Calculations

Rates of biochemically verified prolonged abstinence at 3, 6, and 12 months from a four-arm randomized trial of smoking cessation*

Months after quit

target date

Weight-focused counseling Standard counseling group

Bupropiongroup

(n=106)

Placebo group (n=87)

P-value, bupropion

vs. placebo

Bupropiongroup

(n=89)

Placebo group

(n=67)

P-value, bupropion

vs. placebo

3 41% 18% .001 33% 19% .07

6 34% 11% .001 21% 10% .08

12 24% 8% .006 19% 7% .05

The Results…

Clearly, bupropion improves quitting rates in the weight-focused counseling group.

Page 38: Pitfalls of Hypothesis Testing + Sample Size Calculations

Rates of biochemically verified prolonged abstinence at 3, 6, and 12 months from a four-arm randomized trial of smoking cessation*

Months after quit

target date

Weight-focused counseling Standard counseling group

Bupropiongroup

(n=106)

Placebo group (n=87)

P-value, bupropion

vs. placebo

Bupropiongroup

(n=89)

Placebo group

(n=67)

P-value, bupropion

vs. placebo

3 41% 18% .001 33% 19% .07

6 34% 11% .001 21% 10% .08

12 24% 8% .006 19% 7% .05

The Results…

What conclusion should we draw about the effect of bupropion in the standard counseling group?

Page 39: Pitfalls of Hypothesis Testing + Sample Size Calculations

Authors’ conclusions/Media coverage…

“Among the women who received standard counseling, bupropion did not appear to improve quit rates or time to relapse.”

“For the women who received standard counseling, taking bupropion didn't seem to make a difference.”

Page 40: Pitfalls of Hypothesis Testing + Sample Size Calculations

Correct take-home message…

Bupropion improves quitting rates over counseling alone. Main effect for drug is significant. Main effect for counseling type is NOT

significant. Interaction between drug and

counseling type is NOT significant.

Page 41: Pitfalls of Hypothesis Testing + Sample Size Calculations

Pitfall 5: the fallacy of comparing statistical significance

“the effect was significant in the treatment group, but not significant in the control group” does not imply that the groups differ significantly

Page 42: Pitfalls of Hypothesis Testing + Sample Size Calculations

Example In a placebo-controlled randomized trial

of DHA oil for eczema, researchers found a statistically significant improvement in the DHA group but not the placebo group.

The abstract reports: “DHA, but not the control treatment, resulted in a significant clinical improvement of atopic eczema.”

However, the improvement in the treatment group was not significantly better than the improvement in the placebo group, so this is actually a null result.

Page 43: Pitfalls of Hypothesis Testing + Sample Size Calculations

Misleading “significance comparisons”

Koch C, Dölle S, Metzger M, et al. Docosahexaenoic acid (DHA) supplementation in atopic eczema: a randomized, double-blind, controlled trial. Br J Dermatol 2008;158:786-792.

The improvement in the DHA group (18%) is not significantly greater than the improvement in the control group (11%).

Page 44: Pitfalls of Hypothesis Testing + Sample Size Calculations

Within-group vs. between-group tests

Examples of statistical tests used to evaluate within-group effects versus statistical tests used to evaluate between-group effects

Statistical tests for within-group effects Statistical tests for between-group effects

Paired ttest Two-sample ttest

Wilcoxon sign-rank test Wilcoxon sum-rank test (equivalently, Mann-Whitney U test)

Repeated-measures ANOVA, time effect ANOVA; repeated-measures ANOVA, group*time effect

McNemar’s test Difference in proportions, Chi-square test, or relative risk

Page 45: Pitfalls of Hypothesis Testing + Sample Size Calculations

Also applies to interactions…

Similarly, “we found a significant effect in subgroup 1 but not subgroup 2” does not constitute prove of interaction For example, if the effect of a drug is

significant in men, but not in women, this is not proof of a drug-gender interaction.

Page 46: Pitfalls of Hypothesis Testing + Sample Size Calculations

Within-subgroup significance vs. interaction

Rates of biochemically verified prolonged abstinence at 3, 6, and 12 months from a four-arm randomized trial of smoking cessation*

Months after quit

target date

Weight-focused counseling Standard counseling group P-value for interaction between

bupropion and counseling

type**

Bupropiongroup

abstinence (n=106)

Placebo group

abstinence(n=87)

P-value, bupropion

vs. placebo

Bupropiongroup

abstinence(n=89)

Placebo group

abstinence(n=67)

P-value, bupropion

vs. placebo

3 41% 18% .001 33% 19% .07 .42

6 34% 11% .001 21% 10% .08 .39

12 24% 8% .006 19% 7% .05 .79

*From Tables 2 and 3: Levine MD, Perkins KS, Kalarchian MA, et al. Bupropion and Cognitive Behavioral Therapy for Weight-Concerned Women Smokers. Arch Intern Med 2010;170:543-550. **Interaction p-values were newly calculated from logistic regression based on the abstinence rates and sample sizes shown in this table.

Page 47: Pitfalls of Hypothesis Testing + Sample Size Calculations

Statistical Power

Statistical power is the probability of finding an effect if it’s real.

Page 48: Pitfalls of Hypothesis Testing + Sample Size Calculations

Can we quantify how much power we have for given sample sizes?

Page 49: Pitfalls of Hypothesis Testing + Sample Size Calculations

Null Distribution: difference=0.

Clinically relevant alternative: difference=10%.

Rejection region. Any value >= 6.5 (0+3.3*1.96)

study 1: 263 cases, 1241 controls

For 5% significance level, one-tail area=2.5%

(Z/2 = 1.96)

Power= chance of being in the rejection region if the alternative is true=area to the right of this line (in yellow)

Page 50: Pitfalls of Hypothesis Testing + Sample Size Calculations

Power here = >80%

Rejection region. Any value >= 6.5 (0+3.3*1.96)

Power= chance of being in the rejection region if the alternative is true=area to the right of this line (in yellow)

study 1: 263 cases, 1241 controls

Page 51: Pitfalls of Hypothesis Testing + Sample Size Calculations

Critical value= 0+10*1.96=20

Power closer to 20% now.

2.5% area

Z/2=1.96

study 1: 50 cases, 50 controls

Page 52: Pitfalls of Hypothesis Testing + Sample Size Calculations

Critical value= 0+0.52*1.96 = 1

Power is nearly 100%!

Study 2: 18 treated, 72 controls, STD DEV = 2

Clinically relevant alternative: difference=4 points

Page 53: Pitfalls of Hypothesis Testing + Sample Size Calculations

Critical value= 0+2.59*1.96 = 5

Power is about 40%

Study 2: 18 treated, 72 controls, STD DEV=10

Page 54: Pitfalls of Hypothesis Testing + Sample Size Calculations

Critical value= 0+0.52*1.96 = 1

Power is about 50%

Study 2: 18 treated, 72 controls, effect size=1.0

Clinically relevant alternative: difference=1 point

Page 55: Pitfalls of Hypothesis Testing + Sample Size Calculations

Factors Affecting Power

1. Size of the effect2. Standard deviation of the

characteristic3. Bigger sample size 4. Significance level desired

Page 56: Pitfalls of Hypothesis Testing + Sample Size Calculations

average weight from samples of 100

Null

Clinically relevant alternative

1. Bigger difference from the null mean

Page 57: Pitfalls of Hypothesis Testing + Sample Size Calculations

average weight from samples of 100

2. Bigger standard deviation

Page 58: Pitfalls of Hypothesis Testing + Sample Size Calculations

average weight from samples of 100

3. Bigger Sample Size

Page 59: Pitfalls of Hypothesis Testing + Sample Size Calculations

average weight from samples of 100

4. Higher significance level

Rejection region.

Page 60: Pitfalls of Hypothesis Testing + Sample Size Calculations

Sample size calculations

Based on these elements, you can write a formal mathematical equation that relates power, sample size, effect size, standard deviation, and significance level…

Page 61: Pitfalls of Hypothesis Testing + Sample Size Calculations

Simple formula for difference in proportions

221

2/2

)(p

)Z)(1)((2

p

Zppn

Sample size in each group (assumes equal sized groups)

Represents the desired power (typically .84 for 80% power).

Represents the desired level of statistical significance (typically 1.96).

A measure of variability (similar to standard deviation)

Effect Size (the difference in proportions)

Page 62: Pitfalls of Hypothesis Testing + Sample Size Calculations

Simple formula for difference in means

Sample size in each group (assumes equal sized groups)

Represents the desired power (typically .84 for 80% power).

Represents the desired level of statistical significance (typically 1.96).

Standard deviation of the outcome variable

Effect Size (the difference in means)

2

2/2

2

difference

)Z(2

Zn

Page 63: Pitfalls of Hypothesis Testing + Sample Size Calculations

Sample size calculators on the web…

http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize

http://calculators.stat.ucla.edu http://

hedwig.mgh.harvard.edu/sample_size/size.html

Page 64: Pitfalls of Hypothesis Testing + Sample Size Calculations

These sample size calculations are idealized

•They do not account for losses-to-follow up (prospective studies)

•They do not account for non-compliance (for intervention trial or RCT)

•They assume that individuals are independent observations (not true in clustered designs)

•Consult a statistician!

Page 65: Pitfalls of Hypothesis Testing + Sample Size Calculations

Review Question 1

Which of the following elements does not increase statistical power?

a. Increased sample sizeb. Measuring the outcome variable more

preciselyc. A significance level of .01 rather

than .05d. A larger effect size.

Page 66: Pitfalls of Hypothesis Testing + Sample Size Calculations

Review Question 2

Most sample size calculators ask you to input a value for . What are they asking for?

a. The standard errorb. The standard deviationc. The standard error of the differenced. The coefficient of deviatione. The variance

Page 67: Pitfalls of Hypothesis Testing + Sample Size Calculations

Review Question 3

For your RCT, you want 80% power to detect a reduction of 10 points or more in the treatment group relative to placebo. What is 10 in your sample size formula?

a. Standard deviationb. mean changec. Effect sized. Standard errore. Significance level

Page 68: Pitfalls of Hypothesis Testing + Sample Size Calculations

Homework

Problem Set 5 Reading: The Problem of Multiple

Testing; Misleading Comparisons: The Fallacy of Comparing Statistical Significance (on Coursework)

Reading: Chapters 22-29 Vickers Journal article/article review sheet