sample size determination

SAMPLE SIZE

DETERMINATIONGRAEME HICKEY

INSTITUTE OF INFECTION & GLOBAL HEALTH

APRIL 24TH

INTRODUCTION

Want a copy of these slides?

Email me: [email protected]

SlideShare*: http://www.slideshare.net/GraemeHickey

*Will put up all sessions very shortly

mailto:[email protected]

http://www.slideshare.net/GraemeHickey

The questions…

How many subjects do I need to include in my

study?

I can only afford to test 50 rats; is that enough?

I want to estimate the proportion of IGH

researchers who will vote for The Green Party;

how many should I ask?

Why bother?

Scientific: might miss out on an important discovery

(testing too few), or find a clinically irrelevant effect size

(testing too many)

Ethical: might sacrifice subjects (testing too many) or

unnecessarily expose too few when study success chance

low (testing too few)

Economical: might waste money (testing too many) or

have to repeat the experiment again (testing too few)

Also, generally required for study grant proposals

What not to do

Use same sample size as another (possibly similar) study

Might have just gotten lucky

Base sample size on what is available

Extend study period, seek more money, pool study

Use a nice whole number and hope no one notices

Unless you want your paper rejected

Avoid calculating a sample size because you couldn’t estimate the parameters needed

Do a pilot study or use approximate formulae, e.g. SD ≈ (max – min) / 4

Avoid calculating a sample size because you couldn’t work one out

Speak to a statistician

Approaches

Subjects I need: how many subjects do I need to

detect the effect I consider clinically significant?

versus

Subjects I can get: does my study have adequate

power to detect the effect I consider clinically

significant?

Frameworks

Precision: I want to determine a sample size to

bound the maximum margin of error

versus

Power: I want to determine a sample size that

confers a desired level of power to detect a

clinically relevant effect size

Initial considerations

Sample size calculations are just one aspect of study design (sampling/randomization methods, problem formulation, outcome variables, covariates, …)

For example: a two-stage prevalence survey (e.g. cows) within higher clusters (e.g. farms)

1. How many farms should we sample?

2. How many cows on each farm should we sample?

Basic principles (that we discuss today) remain the same, but the math and rationale differ

BASIC CONCEPTS

Null hypothesis

A null hypothesis (H0) is a statement that one seeks to

nullify with evidence to the contrary

E.g. patients who have blocked coronary arteries

randomized to receive either bypass graft surgery or

percutaneous coronary implant

Calculate the 30-day all cause mortality proportions: p1

(CABG) and p2 (PCI)

H0: no difference in mortality at 30-days, i.e. p1 = p2

Alternative hypothesis

• We can also specify an alterative hypothesis (H1), which

is normally the complement (i.e. ‘opposite’) of the null

• H1: there is a difference in mortality at 30-days; i.e. p1 ≠ p2

Simple vs. composite hypotheses

Any hypothesis which specifies the population distribution

completely, e.g. H0: p = 0.5

Any hypothesis which does not specify the population

distribution completely, e.g. H0: p > 0.5

Standard setups

Test for Null hypothesisAlternative

hypothesis

Equality H0: μT – μP = 0 H1: μT – μP ≠ 0

Non-inferiority H0: μT – μP ≥ δ H1: μT – μP < δ

Superiority H0: μT – μP ≤ δ H1: μT – μP > δ

Equivalence H0: |μT – μP| ≥ δ H1: |μT – μP| < δ

δ = effect size we consider clinically significant

μT = average in treatment group

μP = average in placebo group

P-values

Probability of observing effect at least as extreme if the null

hypothesis is true

If the P-value is less than some predefined significance

level, we conclude that the evidence (data) are inconsistent

with the null hypothesis, and we reject the null [in favor of

the alternative hypothesis]

Worth remembering: sometimes there is no effect

Example

Toss a coin 5 times and count the number of heads

If the coin is fair, Pr[Head] = 0.5 (null hypothesis)

It is possible I am using a ‘trick coin’, so that Pr[Head] ≠ 0.5

(alternative hypothesis)

Observe H, H, H, H, H – is this enough evidence to reject

the null?

Example

Set my significance level to 0.05

If the null hypothesis is true, probability of observing 5 heads is (0.5)5 = 0.03

If the null hypothesis is true, probability of observing 5 tails is (0.5)5 = 0.03

Probability of observing a result at least as extreme as the one found = 0.03 + 0.03 = 0.06 (two-tailed test)

0.06 > 0.05 – therefore I do not reject the null hypothesis that my coin is fair

Errors

We are all (hopefully) familiar with diagnostic test errors

Type I error

False positive

Type II error

False negative

“Congrats!

You are

pregnant!”

“Nope… not

pregnant!”

Courtroom example

H0: defendant innocent

Courtroom example

H1: defendant guilt

Courtroom example

Test: 12 jurors decide based on evidence

Courtroom example

Evidence: forensic evidence (i.e. data)

collected by the police

Courtroom outcomes

Truth

(unknown)

Jury outcome

Guilty Innocent

Defendant

guilty

Defendant

innocent

Courtroom outcomes

Failure to reject the null hypothesis of innocence

does not mean the defendant is innocent

We can only say that the standard of evidence was

insufficient

We cannot (in practice) prove innocence, only

present evidence to reject the hypothesis:

Pr[Innocence | Evidence ] ≠ Pr[Evidence | Innocence]

Errors will occur

Which is worse:

1. Sending an innocent person to prison? (Type I error)

2. Setting a guilty person free? (Type II error)

If you were a judge overseeing lots and lots of

cases each year, what proportion of the time would

you be willing to accept a Type I error and a Type II

error?

Significance

Probability that you reject the null hypothesis (in favor of

the alternative hypothesis) when the null hypothesis is true:

α = Pr[reject H0 | H0 is true] = Pr[accept H1 | H0 is true]

What does this mean? If we set α = 0.05, then

Power

Probability that you reject the null hypothesis (in favor of

the alternative hypothesis) when the null hypothesis is

false:

1–β = Pr[reject H0 | H1 is true] = Pr[accept H1 | H1 is true]

SAMPLE SIZE & POWER

CALCULATIONS

Recipe for most common formulation

1. Specify hypothesis

2. Specify the significance level (α)

3. Specify the effect size that is clinically relevant

4. Specify the power (1-β)

5. Use appropriate software / formulae to determine the minimum sample size

A more preferable approach:

4. Specify sample sizes you can reasonably test (resources, ethics, etc.)

5. Use appropriate software / formulae to determine the power

Selecting the minimal clinically relevant

effect size

Preferably this is based on science, e.g. a mean reduction in weight of 3kg using a new weight loss drug compared to dieting alone would be expected to lead to a 50% reduction in stroke rates

Could base it on previous data:

• Published data

• Pilot study

• Expert scientific opinion

Don’t ask your statistician to choose this—you wouldn’t let them do your experiment / treat your patients!

Example 1: biased coin

I can afford to toss a coin n = 10 times, and want

to know if it is biased in favor of heads

p = Pr[Head on any single toss]

H0: p = 0.5

H1: p > 0.5 (one-sided)


I will use a significance level of α = 0.05, which means that

I should reject the null hypothesis if I observed more than 7

heads

Why?

X = a random variable denoting the number of heads

observed out of 10 tosses

X ~ Binomial(10, p)

Pr(X > 7 | H0) = 0.0547 (which is near enough)

(remember: p = 0.5 under H0)


So what is the power of the test?

If I wanted to be able to detect a coin that is biased

by 10%

Pr(X > 7 | p = 0.6) = 0.17

If I wanted to be able to detect a coin that is biased

by 20%

Pr(X > 7 | p = 0.7) = 0.38

Software

PS: Power and Sample Size Calculation

Vanderbilt University (Windows only)

http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize

G*Power

Universität Düsseldorf (Windows + Mac OSX)

http://www.gpower.hhu.de/en.html

Statistical software (e.g. R [base + packages], Stata, SAS, …)

Many online web-calculators: google ‘sample size calculator’

Commercial software: statisticians in academia don’t use it as it costs $$$

http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize

http://www.gpower.hhu.de/en.html

Demonstration of G*Power

Let’s confirm our power calculations with:

n = 10

α = 0.0547

What happens if I put α = 0.05?

Reporting sample size calculations

Journals now ask for the ‘sample size rationale’

Aim to report enough detail so that an independent person could reproduce the calculation:

• Hypothesis

• Clinically relevant effect size

• One or two-tailed test

• Power

• Significance

• Test used (e.g. an exact or approximation)

• Software used

Any values used to determine the clinically relevant effect size should be given, including references to articles if available

Example 2: comparing two means

A randomized controlled trial has been planned to evaluate

a brief psychological intervention in comparison to usual

treatment in the reduction of suicidal ideation amongst

patients presenting at hospital with deliberate self-

poisoning. Suicidal ideation will be measured on the Beck

scale; the standard deviation of this scale in a previous

study was 7.7, and a difference of 5 points is considered to

be of clinical importance. It is anticipated that around one

third of patients may drop out of treatment.


Required information

• Primary outcome variable = The Beck scale for suicidal

ideation. A continuous variable summarized by means.

• Standard deviation = 7.7 points

• Size of difference of clinical importance = 5 points

• Significance level = 5%

• Power = 80%

• Type of test = two-sided t-test


Let’s use G*Power to calculate the sample size

Should get 39 in each arm (total n = 78)

What if I knew dropout rate was ~1/3 in RCTs?

nrequired = ncalculated / (1 - 1/3) = 58.5 in each arm, which we

round up to 59

Therefore we want a sample size of 118


Writing it up for a manuscript or grant application:

”Using a two-tailed t-test, a sample size of 38 in each group

will be sufficient to detect a difference of 5 points on the

Beck scale of suicidal ideation, assuming a standard

deviation of 7.7 points, a power of 80%, and a significance

level of 5%. This number has been increased to 59 per

group (total of 118), to allow for a predicted drop-out from

treatment of approximately one-third. Sample size

calculations were performed using G*Power Version 3.1.2

(Faul et al. 2007)."

Power curves

What method is right for me?

Same principles as choosing the right test to analyze your

data

For example, if you are measuring weight before an after

an intervention then use a test appropriate for paired data

rather than treating them as individual groups

If you expect your data to be grossly skewed, consider

transformations a priori or use of a non-parametric test

What method is right for me?

NB: If it’s free software then reference it!


Let’s use PS to do the same calculation

Go to demo:

• Example

• Power curves


A mathematical treatment of the classical ‘comparing two

means’ problem exposes the standard sample size

determination formula

Let:


We’ll assume that the data in each group are sampled from

a normal distribution with common standard deviation σ

and group mean (μ1 or μ2)

With a bit of math we can show


The hypotheses are

We will reject H0 at α = 0.05 if either:

1. The lower 95% CI limit for μ2-μ1 is >0

2. The upper 95% CI limit for μ2-μ1 is <0

H0:

H1:


Look at the first case:

The power is determined by calculating the probability that

With some manipulation, and setting the power equal to 0.8

we get an approximate sample size formula:

n > 2×(2.8σ / d)2


Try it—should get n > 37.2, meaning we would need 38

patients in each arm (or 76 patients in total)

General formula

For many problems a normal approximation can be used,

so the following formula (for general α and β) is common:

n > 2 × (Zα/2 + Zβ)2 × σ2 / d2

where σ is a measure of the variation and d is the effect

size the study is designed to detect

Example 3: Precision

Election polls are interested in estimating the share

of voters who will vote for each party


There are only 2 political parties: the reds and the blues. The

BBC news have commissioned an election poll to estimate the

proportional of voters who will vote red in the General Election.

The news editor says he wants to estimate the proportion with a

margin of error <5% (or in the USA, “within 5 points”).

Let:

p = the true (unknown) proportion of voters in the UK who will

vote red

n = the number of voters the BBC will ask in the opinion poll

X = the number of voters (out of n) that will say they will vote red

(note: we haven’t done the poll yet, so this is unknown and

therefore random for the moment)


We have that X ~ Bin(n, p)

If we let Y = X / n, so that Y is a proportion, then we have

• the expectation of Y is p

• the variance of Y is p(1-p)

Subject to certain assumptions, we can use a normal

(Gaussian) approximation to say

Y ~ N(p, p(1-p))


The margin of error is defined to be approximately 2 standard deviations of this estimator either side of the estimator: 2√(p(1-p)/n)

We don’t know p however—if we did then we wouldn’t be doing this

Turns out that the ‘worst case’ scenario (i.e. the largest margin of error) is found when p = 0.5, which if we substitute in gives: 1 / √n

Therefore we want 1 / √n < 0.05, which means n > 202 = 400

UNFEASIBLE SAMPLE

SIZES & POWER

Sample size

You’ve calculated a sample size for your study and it is too

large, or you’ve calculated the power from your study at the

maximum allowable sample size and it’s too low

Don’t give up!

Factors that affect power

Sample size: power increases as sample size increases

Variation: power decreases as sample size decreases

Effect size: power increases as clinically relevant effect size that one wishes to detect increases

Test: one or two tailed?

Software: don’t be alarmed if different software gives marginally differ sample sizes (slightly different approximations)

Strategies for maximizing power

• Use frequent outcomes or consider using a composite outcome as power is driven mainly by the number of events rather than the total sample size

• Use paired design (such as crossover trial)

• Use continuous variables, but remember to differentiate between statistical significance and clinical relevance

• Design study protocol to record follow-up data at a point when differences likely to be large

• Pool resources, e.g. a multi-center trial

Have you considered everything?

• Justified? E.g. budgetary constraints / ethical

• Missing any important covariates?

• Expect any drop outs?

• Unequal groups?

• Multiple comparisons / tests?

• Sensitivity analysis to assumptions made?

• Many post hoc adjustments and formulae adaptations available

Mid-trial sample size recalculation

For numerous reasons there might be reason to reconsider the

number of subjects after a trial has commenced

For example, if the sample size was powered to detect an effect

size d, but it is clear mid-way that the treatment effect is <d, but

that it might still be clinically beneficial

Authorities (e.g. FDA) used to say “no” – issues of credibility

Abandoning study and redesigning can lead to inflated Type I

errors

There are new statistical methods that can handle this though

Retrospective sample size calculations

Unnecessary and uninformative!

Hoenig JM, Heisey DM. The Abuse of Power: The

Pervasive Fallacy of Power Calculations for Data Analysis,

The American Statistician, 2001; 55: 19-24

Questions?

@ResearchWahlberg

sample size determination

Science

lucky base sample size

irrelevant effect size

pilot study

pool study

t p h1

money testing

t p supe

availableextend study