biostat lecture series 2019 sample size and power

49
BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER Wei Hou, PhD Email: [email protected] Apr 17 th , 2019 1

Upload: others

Post on 03-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

BIOSTAT LECTURE SERIES 2019

SAMPLE SIZE AND POWER

Wei Hou, PhD

Email: [email protected]

Apr 17th, 2019

1

Page 2: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Outline

Post-hoc power

Intuition behind sample size and power calculation

Common sample size formula for different tests

What to bring when meeting with a statistician

2

Page 3: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Question

Have your ever been asked by your reviewer/editor to calculate post-hoc power (observed power) when you are publishing non-significant results?

Bababekov et.al. 2018:

“we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%—with the given sample size and effect size observed in that study.”Bababekov, Y. J., Stapleton, S. M., Mueller, J. L., Fong, Z. V., and Chang, D. C. (2018). A proposal to mitigate the consequences of type 2 error in surgical science. Annals of Surgery 267, 621-622

3

Page 4: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Hypothesis

Research question: will vitamin D taking during

antibiotics treatment help patient to recover faster?

Primary outcome: percentage of recovery after the 1st

round of treatment

Hypothesis: H0: P1 P0 vs H1: P1 > P0

What test to use?

4

Page 5: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

type I and II errors

TruthResult of statistical test

Fail to reject null

hypothesis

(test shows that Vit D

is NOT superior)

Reject null hypothesis

(test shows that Vit D

is superior)

Null hypothesis is

TRUE

(Vit D is NOT

superior)

Type I error

(false positive)

α

Null hypothesis is

FALSE

(Vit D is superior)

Type II error

(false negative)

β

5

Page 6: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Quick review

Type I error: false positive

Type II error: false negative

α :P(Type I error)=P(reject H0|H0 is true)

β :P(Type II error)= P(fail to reject H0 | H1 is true)

Power=1- β= P(reject H0 | H1 is true)

P-value=P(observing difference is as large as or larger than the observed difference|H0 is right)

Reject H0 when P-value < α

Null hypothesis H0 is assumed to be true until proven otherwise.

6

Page 7: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Why does an editor request Post-hoc power?

When you have a non-significant /negative result,

the editor wants to know whether the result is true

negative or false negative (β, concluding there is no

effect when there actually is an effect).

Unfortunately, reporting observed power does not

answer the question. The reported observed power

does not provide any information about whether the

result is true negative or not.

7

Page 8: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Observed Power is not meaningful

Hoenig and Heisey (2001):

“For any test the observed power

is a 1:1 function of the p value.

When a test is marginally

significant (P = .05), the estimated

power is 50%.”

Reporting observed power is just

another way of reporting the p

value.

Hoenig and Heisey (2001)

8

Page 9: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

More about observed Power

Yuan and Maxwell 2005: Observed power “is

almost always a biased estimator of the true

power”

Hoenig and Heisey (2001): “higher observed

power does not imply stronger evidence for a null

hypothesis that is not rejected”.

Say to the editor/reviewer.

9

Page 10: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

When do we need power and sample

size calculation?

Fundamentals of Clinical Trials (4th Edition 2010.): Clinical trials should have sufficient statistical power to detect differences between groups considered to be of clinical importance. Therefore, calculation of sample size with provision for adequate levels of significance and power is an essential part of planning.

• Statistical analysis follows study design

• Power and sample size calculation based on the primary

analysis

• A real screenshot from a recent grant review:

10

Page 11: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Hypothesis Testing, Significance Level &

Power

Suppose the primary outcome is binary (e.g. recover

rate), and we want to test whether a true difference

exists in the recover rates of two groups.

H0: P1 P0

Assume the true difference is δ=P1 - P0,

then β or power (1- β) depends on δ, N, and α.

11

Page 12: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Power curve: different N12

Page 13: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Power curve: different α13

Page 14: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Power Formula

Depends on study design

Not hard, but can be VERY algebra intensive

Consult with a statistician

Use software, e.g. G*Power (free), PASS, R (free),

SAS and etc.

14

Page 15: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Analysis Follows Study Design

Randomized controlled trial (RCT)

Stratified randomized trial

Non-inferiority trial

Cross-over study

Non-randomized intervention study

Observational study

Prevalence study

Measuring sensitivity and specificity

15

Page 16: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

In a parallel study, we are comparing the HbA1c

levels of two randomized groups.

In a cross-over study of COPD, we are comparing

the exercise duration times of the same person

(treated and on placebo)

16

Types of analysis

Two-sample Independent T-test

Paired T-test

Page 17: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

In a cancer study, we want to compare the response rate between a new drug and a placebo

A study examined changes in smoking status after an intervention. The same participants were asked previously and again after the intervention.

A randomized clinical trial to compare the 10-year overall survival between a new drug and a placebo in women with invasive breast cancer.

17

Types of analysis

Chi-square test or Z test for proportions

McNemar’s test for the paired data

Kaplan-Meier Curve and Log rank test

Page 18: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Sample Size Formula Based on Analysis

Variables of interest

type of data e.g. continuous, categorical

Desired power

Desired significance level

Effect/difference of clinical importance

Standard deviations of continuous outcome

variables

One or two-sided tests

18

Page 19: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase I: Dose Escalation

Dose limiting toxicity (DLT) must be defined

Decide a few dose levels (e.g. 4)

At least three patients will be treated on each dose

level (cohort)

Not a power or sample size calculation issue

Entry of patients to a new dose level does not occur

until all patients in the previous level are beyond a

certain time frame where you look for DLT

19

Page 20: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase II Example:

Two-Stage Optimal Design

Single arm, two stage, using an optimal design & predefined response

Rule out response probability of 20% (H0: p≤0.20)

Level that demonstrates useful activity is 40% (H1:p≥0.20)

Let α = 0.1 (10% probability of accepting a poor agent)

Let β = 0.1 (10% probability of rejecting a good agent)

Charts in Simon (1989) paper with different amounts and varying α and β values

01 pp -

20

Page 21: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Blow up: Simon (1989) Table21

Page 22: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase II Example

Initially enroll 17 patients.

0-3 of the 17 have a clinical response then stop accrual

and assume not an active agent

If ≥ 4/17 respond, then accrual will continue to 37

patients.

22

Page 23: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase II Example

If 4-10 of the 37 respond this is insufficient

activity to continue

If ≥ 11/37 respond then the agent will be

considered active.

Under this design if the null hypothesis were

true (20% response probability) there is a

55% probability of early termination

23

Page 24: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Sample Size Differences

If the null hypothesis (H0) is true

Using two-stage optimal design

On average 26 subjects enrolled

Using a 1-sample test of proportions

36 patients based on one-sided binomial test

Using a 2-sample randomized test of proportions

77 patients per group based on one-sided Fisher’s exact test

24

Page 25: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase III RCT: Continuous Outcomes

0: 010 - H VS 0: 011 - H

Assuming variance is known, sample size needed is2

,/)(4)( 222

2/ ZZNTotal

δ denote the true difference between and

P(x > )=α/2, x is from standard normal distribution

If is unknown, effect size could be expressed as the

standardized difference

If one-sided test is used, substitute with

1 0

2

/

2/Z

2/ZZ

Suppose we want to compare a continuous outcome (e.g. HbA1c)

between intervention and control groups. Hypotheses:

25

Page 26: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase III RCT: Continuous Outcomes

The Effect of Non-surgical Periodontal Therapy on Hemoglobin A1c

Levels in Persons with Type 2 Diabetes and Chronic Periodontitis: A

Randomized Clinical Trial, Engebretson et. al. 2013, JAMA

The treatment group received scaling and root planing plus

chlorhexidine oral rinse at baseline, and supportive periodontal

therapy at three and six months. The control group received no

treatment for six months.

We assume a clinically meaningful difference of 0.6% in HbA1c

between the two arms with a standard deviation of 2%.

26

Page 27: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase III RCT: Continuous Outcomes

Two independent samples, two-sided test

Set α=.05, β=.10, 90% power

Then

Set

Then

%,6.0 %2

282.1,96.1 1.0025.0 ZZ

4682^6.0/2)^2(2)^282.196.1(42 N

27

Page 28: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase III RCT: Continuous Outcomes28

Page 29: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase III RCT: Continuous Outcomes G*Power

29

Page 30: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase III RCT: Binary Outcomes

0: 010 - ppH VS 0: 011 - ppH

Suppose we want to compare the response rates

between vit D treatment and placebo (20% vs 40%).

Based on a Z-test, the sample size needed is

2

01

2

2/ )/()1()(42 ppppZZN --

• is the pooled proportion

• If one-sided test is used, substitute with 2/ZZ

2/)( 01 ppp

30

Page 31: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase III RCT: Binary Outcomes

Two independent samples, two-sided Z test

Set α=.05, β=.10, 90% power,

Then

Set

Then

And,

,40.1 p 20.0 p

30.2/)2.4(. p

282.1,960.1 1.0025.0 ZZ

2202)^2.4/(.)7)(.3(.2)^282.1960.1(42 -N

31

Page 32: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase III RCT: Binary Outcomes32

Page 33: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Phase III RCT: Binary Outcomes G*Power

33

Page 34: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Sample Size for Testing Non-inferiority

Suppose we want to test whether a new treatment is equivalent to an established treatment in response rate.

Can we propose the hypotheses as follows:

H0: P1 -P 0 ≠ 0 vs H1: P1 -P 0 0 ???

Use the formula

N=∞, we will never reject H0

2

01

2

2/ )/()1()(42 ppppZZN --

34

Page 35: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Sample Size for Testing Non-inferiority

How about using the original hypotheses:

H0: P1 = P 0 vs H1: P1 ≠ P 0 ???

Calculate the sample size based on “Fail to reject

H0”?

However, failure to reject H0 is not sufficient to claim

two groups to be equal but merely that the evidence is

inadequate to say they are different.

35

Page 36: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Sample Size for Testing Non-inferiority

No statistical method to demonstrate complete equivalence

We can pre-define a margin of difference, δ

H0: the two groups differ by less than δ

H1: the two groups differ by more than δ

Then we can use previous formula: Dichotomous response (p1=p0=p):

Continuous response:

22 /))(1(42 ZZppN -

22 )//()(42 ZZN

36

Page 37: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Multiple response variables

More than one question are equally important

More than one primary variable used to assess a single primary question

Multiple response variables are correlated

Multiple testing issues: when multiple comparisons are made, the chance of finding a significant difference in one of the comparisons (when, in fact, no real differences exist between groups) is greater than the stated significance level.

α need to be adjusted to control familywise error.

37

Page 38: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Interim Analysis

Analysis of the data before the study is ended with

the intention of possibly terminating the study early

If traditional tests are used at both the middle and

the end of the study, Type I error get inflated

To maintain the whole Type I error, α needs to be

adjusted at each interim analysis

# of interim

analysis 0 1 4 9

Type I error 0.05 0.08 0.14 0.2

38

Page 39: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Non-adherence adjustment

Drop-out and drop-in both can happen in a RCT

According to ITT, these participants remain in the analysis

They tend to dilute any difference between the two groups which might be produced by intervention

A simple formula for nonadherence adjustment (Lachin, 1981):

Where R0 is dropout and R1 is dropin rate.

2

10

* )1/( RRNN --

39

Page 40: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Sample size based on confidence

interval

The desired confidence interval width is used for

sample size calculation

For testing the null hypothesis of no treatment

effect, hypothesis testing and confidence intervals

give the same conclusion

The CI method might yield a power of only 50% to

detect a difference of half width of the confidence

interval

40

Page 41: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Estimating Sample Size Parameters

Obtaining reliable estimates (e.g. effect size or

standard deviation) can be challenging

Use pilot studies to refine estimates

Use adaptive design which modify the sample size

based on updated estimates

41

Page 42: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Effect size

Cohen’s D:

Measures the magnitude of a treatment effect

Unlike significance tests, it is independent of sample size

Widely used in the meta analyses

Cohen (1988) hesitantly defined effect sizes as "small" "medium" and "large", stating that "there is a certain risk in inherent in offering conventional operational definitions for those terms for use in power analysis in as diverse a field of inquiry as behavioral science" (p. 25).

Effect size d

Small 0.20

Medium 0.50

Large 0.80

Page 43: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Approximate Nature

Parameters used in the calculation are estimates

Estimate of the relative effectiveness may be based on a population different from that intended to be studied

The effectiveness is often overestimated

Revisions of inclusion and exclusion criteria may influence the type of participants entering the trial

Mathematical models used may only approximate the true, but unknown, distribution of the response variables

So PI should be as conservative as can be justified while still being realistic in estimating the parameters used in the calculation!

43

Page 44: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

More Notes

Study’s primary outcome is the variable you do the sample size calculation for

If secondary outcome variables considered important make sure sample size is sufficient

Increase the ‘real’ sample size to reflect loss to follow up, lack of compliance, etc.

44

Page 45: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

What does a statistician need?45

Primary hypothesis

Including null and alternative hypotheses

Study design

RCT? Cross-over?

Data types of primary endpoints

Continuous or dichotomous

Significance level

Usually 0.05. Need to be adjusted for interim tests or multiple endpoints

Value of other parameters

Standard deviation – from pilot study or published data

Smallest effect size that is clinically meaningful

e.g. 0.5D in myopia studies

Intended power

Usually 80%. Sometimes 90% for large studies.

Page 46: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

But46

https://www.youtube.com/watch?v=PbODigCZqL8

Page 47: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Research Flow Chart

Questions → Hypotheses → Experimental Design → Samples → Data →

Analyses → Conclusions

Take all of your study information to a statistician early and often

47

Page 48: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Quiz time!

True or False?

Sample size↑ N → power ↑

Significance level: α ↑ → power ↓

Effect size: δ ↑ → power ↓

Variation (continuous outcome): σ2 ↑ → power ↑

One-tailed test power < Two-tailed test power

48

Page 49: BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Thank you!49

Q. How many statisticians does it take to

change a light bulb?

A. That depends. It is really a matter of

power.From: Stuart Howell

https://jcdverha.home.xs4all.nl/scijokes/1_2.html