sample size determination
TRANSCRIPT
Want a copy of these slides?
Email me: [email protected]
SlideShare*: http://www.slideshare.net/GraemeHickey
*Will put up all sessions very shortly
The questions…
How many subjects do I need to include in my
study?
I can only afford to test 50 rats; is that enough?
I want to estimate the proportion of IGH
researchers who will vote for The Green Party;
how many should I ask?
Why bother?
Scientific: might miss out on an important discovery
(testing too few), or find a clinically irrelevant effect size
(testing too many)
Ethical: might sacrifice subjects (testing too many) or
unnecessarily expose too few when study success chance
low (testing too few)
Economical: might waste money (testing too many) or
have to repeat the experiment again (testing too few)
Also, generally required for study grant proposals
What not to do
Use same sample size as another (possibly similar) study
Might have just gotten lucky
Base sample size on what is available
Extend study period, seek more money, pool study
Use a nice whole number and hope no one notices
Unless you want your paper rejected
Avoid calculating a sample size because you couldn’t estimate the parameters needed
Do a pilot study or use approximate formulae, e.g. SD ≈ (max – min) / 4
Avoid calculating a sample size because you couldn’t work one out
Speak to a statistician
Approaches
Subjects I need: how many subjects do I need to
detect the effect I consider clinically significant?
versus
Subjects I can get: does my study have adequate
power to detect the effect I consider clinically
significant?
Frameworks
Precision: I want to determine a sample size to
bound the maximum margin of error
versus
Power: I want to determine a sample size that
confers a desired level of power to detect a
clinically relevant effect size
Initial considerations
Sample size calculations are just one aspect of study design (sampling/randomization methods, problem formulation, outcome variables, covariates, …)
For example: a two-stage prevalence survey (e.g. cows) within higher clusters (e.g. farms)
1. How many farms should we sample?
2. How many cows on each farm should we sample?
Basic principles (that we discuss today) remain the same, but the math and rationale differ
Null hypothesis
A null hypothesis (H0) is a statement that one seeks to
nullify with evidence to the contrary
E.g. patients who have blocked coronary arteries
randomized to receive either bypass graft surgery or
percutaneous coronary implant
Calculate the 30-day all cause mortality proportions: p1
(CABG) and p2 (PCI)
H0: no difference in mortality at 30-days, i.e. p1 = p2
Alternative hypothesis
• We can also specify an alterative hypothesis (H1), which
is normally the complement (i.e. ‘opposite’) of the null
• H1: there is a difference in mortality at 30-days; i.e. p1 ≠ p2
Simple vs. composite hypotheses
Any hypothesis which specifies the population distribution
completely, e.g. H0: p = 0.5
Any hypothesis which does not specify the population
distribution completely, e.g. H0: p > 0.5
Standard setups
Test for Null hypothesisAlternative
hypothesis
Equality H0: μT – μP = 0 H1: μT – μP ≠ 0
Non-inferiority H0: μT – μP ≥ δ H1: μT – μP < δ
Superiority H0: μT – μP ≤ δ H1: μT – μP > δ
Equivalence H0: |μT – μP| ≥ δ H1: |μT – μP| < δ
δ = effect size we consider clinically significant
μT = average in treatment group
μP = average in placebo group
P-values
Probability of observing effect at least as extreme if the null
hypothesis is true
If the P-value is less than some predefined significance
level, we conclude that the evidence (data) are inconsistent
with the null hypothesis, and we reject the null [in favor of
the alternative hypothesis]
Worth remembering: sometimes there is no effect
Example
Toss a coin 5 times and count the number of heads
If the coin is fair, Pr[Head] = 0.5 (null hypothesis)
It is possible I am using a ‘trick coin’, so that Pr[Head] ≠ 0.5
(alternative hypothesis)
Observe H, H, H, H, H – is this enough evidence to reject
the null?
Example
Set my significance level to 0.05
If the null hypothesis is true, probability of observing 5 heads is (0.5)5 = 0.03
If the null hypothesis is true, probability of observing 5 tails is (0.5)5 = 0.03
Probability of observing a result at least as extreme as the one found = 0.03 + 0.03 = 0.06 (two-tailed test)
0.06 > 0.05 – therefore I do not reject the null hypothesis that my coin is fair
Errors
We are all (hopefully) familiar with diagnostic test errors
Type I error
False positive
Type II error
False negative
“Congrats!
You are
pregnant!”
“Nope… not
pregnant!”
Courtroom outcomes
Failure to reject the null hypothesis of innocence
does not mean the defendant is innocent
We can only say that the standard of evidence was
insufficient
We cannot (in practice) prove innocence, only
present evidence to reject the hypothesis:
Pr[Innocence | Evidence ] ≠ Pr[Evidence | Innocence]
Errors will occur
Which is worse:
1. Sending an innocent person to prison? (Type I error)
2. Setting a guilty person free? (Type II error)
If you were a judge overseeing lots and lots of
cases each year, what proportion of the time would
you be willing to accept a Type I error and a Type II
error?
Significance
Probability that you reject the null hypothesis (in favor of
the alternative hypothesis) when the null hypothesis is true:
α = Pr[reject H0 | H0 is true] = Pr[accept H1 | H0 is true]
What does this mean? If we set α = 0.05, then
Power
Probability that you reject the null hypothesis (in favor of
the alternative hypothesis) when the null hypothesis is
false:
1–β = Pr[reject H0 | H1 is true] = Pr[accept H1 | H1 is true]
Recipe for most common formulation
1. Specify hypothesis
2. Specify the significance level (α)
3. Specify the effect size that is clinically relevant
4. Specify the power (1-β)
5. Use appropriate software / formulae to determine the minimum sample size
A more preferable approach:
4. Specify sample sizes you can reasonably test (resources, ethics, etc.)
5. Use appropriate software / formulae to determine the power
Selecting the minimal clinically relevant
effect size
Preferably this is based on science, e.g. a mean reduction in weight of 3kg using a new weight loss drug compared to dieting alone would be expected to lead to a 50% reduction in stroke rates
Could base it on previous data:
• Published data
• Pilot study
• Expert scientific opinion
Don’t ask your statistician to choose this—you wouldn’t let them do your experiment / treat your patients!
Example 1: biased coin
I can afford to toss a coin n = 10 times, and want
to know if it is biased in favor of heads
p = Pr[Head on any single toss]
H0: p = 0.5
H1: p > 0.5 (one-sided)
Example 1: biased coin
I will use a significance level of α = 0.05, which means that
I should reject the null hypothesis if I observed more than 7
heads
Why?
X = a random variable denoting the number of heads
observed out of 10 tosses
X ~ Binomial(10, p)
Pr(X > 7 | H0) = 0.0547 (which is near enough)
(remember: p = 0.5 under H0)
Example 1: biased coin
So what is the power of the test?
If I wanted to be able to detect a coin that is biased
by 10%
Pr(X > 7 | p = 0.6) = 0.17
If I wanted to be able to detect a coin that is biased
by 20%
Pr(X > 7 | p = 0.7) = 0.38
Software
PS: Power and Sample Size Calculation
Vanderbilt University (Windows only)
http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize
G*Power
Universität Düsseldorf (Windows + Mac OSX)
http://www.gpower.hhu.de/en.html
Statistical software (e.g. R [base + packages], Stata, SAS, …)
Many online web-calculators: google ‘sample size calculator’
Commercial software: statisticians in academia don’t use it as it costs $$$
Demonstration of G*Power
Let’s confirm our power calculations with:
n = 10
α = 0.0547
What happens if I put α = 0.05?
Reporting sample size calculations
Journals now ask for the ‘sample size rationale’
Aim to report enough detail so that an independent person could reproduce the calculation:
• Hypothesis
• Clinically relevant effect size
• One or two-tailed test
• Power
• Significance
• Test used (e.g. an exact or approximation)
• Software used
Any values used to determine the clinically relevant effect size should be given, including references to articles if available
Example 2: comparing two means
A randomized controlled trial has been planned to evaluate
a brief psychological intervention in comparison to usual
treatment in the reduction of suicidal ideation amongst
patients presenting at hospital with deliberate self-
poisoning. Suicidal ideation will be measured on the Beck
scale; the standard deviation of this scale in a previous
study was 7.7, and a difference of 5 points is considered to
be of clinical importance. It is anticipated that around one
third of patients may drop out of treatment.
Example 2: comparing two means
Required information
• Primary outcome variable = The Beck scale for suicidal
ideation. A continuous variable summarized by means.
• Standard deviation = 7.7 points
• Size of difference of clinical importance = 5 points
• Significance level = 5%
• Power = 80%
• Type of test = two-sided t-test
Example 2: comparing two means
Let’s use G*Power to calculate the sample size
Should get 39 in each arm (total n = 78)
What if I knew dropout rate was ~1/3 in RCTs?
nrequired = ncalculated / (1 - 1/3) = 58.5 in each arm, which we
round up to 59
Therefore we want a sample size of 118
Example 2: comparing two means
Writing it up for a manuscript or grant application:
”Using a two-tailed t-test, a sample size of 38 in each group
will be sufficient to detect a difference of 5 points on the
Beck scale of suicidal ideation, assuming a standard
deviation of 7.7 points, a power of 80%, and a significance
level of 5%. This number has been increased to 59 per
group (total of 118), to allow for a predicted drop-out from
treatment of approximately one-third. Sample size
calculations were performed using G*Power Version 3.1.2
(Faul et al. 2007)."
What method is right for me?
Same principles as choosing the right test to analyze your
data
For example, if you are measuring weight before an after
an intervention then use a test appropriate for paired data
rather than treating them as individual groups
If you expect your data to be grossly skewed, consider
transformations a priori or use of a non-parametric test
Example 2: comparing two means
Let’s use PS to do the same calculation
Go to demo:
• Example
• Power curves
Example 2: comparing two means
A mathematical treatment of the classical ‘comparing two
means’ problem exposes the standard sample size
determination formula
Let:
Example 2: comparing two means
We’ll assume that the data in each group are sampled from
a normal distribution with common standard deviation σ
and group mean (μ1 or μ2)
With a bit of math we can show
Example 2: comparing two means
The hypotheses are
We will reject H0 at α = 0.05 if either:
1. The lower 95% CI limit for μ2-μ1 is >0
2. The upper 95% CI limit for μ2-μ1 is <0
H0:
H1:
Example 2: comparing two means
Look at the first case:
The power is determined by calculating the probability that
With some manipulation, and setting the power equal to 0.8
we get an approximate sample size formula:
n > 2×(2.8σ / d)2
Example 2: comparing two means
Try it—should get n > 37.2, meaning we would need 38
patients in each arm (or 76 patients in total)
General formula
For many problems a normal approximation can be used,
so the following formula (for general α and β) is common:
n > 2 × (Zα/2 + Zβ)2 × σ2 / d2
where σ is a measure of the variation and d is the effect
size the study is designed to detect
Example 3: Precision
Election polls are interested in estimating the share
of voters who will vote for each party
Example 3: Precision
There are only 2 political parties: the reds and the blues. The
BBC news have commissioned an election poll to estimate the
proportional of voters who will vote red in the General Election.
The news editor says he wants to estimate the proportion with a
margin of error <5% (or in the USA, “within 5 points”).
Let:
p = the true (unknown) proportion of voters in the UK who will
vote red
n = the number of voters the BBC will ask in the opinion poll
X = the number of voters (out of n) that will say they will vote red
(note: we haven’t done the poll yet, so this is unknown and
therefore random for the moment)
Example 3: Precision
We have that X ~ Bin(n, p)
If we let Y = X / n, so that Y is a proportion, then we have
• the expectation of Y is p
• the variance of Y is p(1-p)
Subject to certain assumptions, we can use a normal
(Gaussian) approximation to say
Y ~ N(p, p(1-p))
Example 3: Precision
The margin of error is defined to be approximately 2 standard deviations of this estimator either side of the estimator: 2√(p(1-p)/n)
We don’t know p however—if we did then we wouldn’t be doing this
Turns out that the ‘worst case’ scenario (i.e. the largest margin of error) is found when p = 0.5, which if we substitute in gives: 1 / √n
Therefore we want 1 / √n < 0.05, which means n > 202 = 400
Sample size
You’ve calculated a sample size for your study and it is too
large, or you’ve calculated the power from your study at the
maximum allowable sample size and it’s too low
Don’t give up!
Factors that affect power
Sample size: power increases as sample size increases
Variation: power decreases as sample size decreases
Effect size: power increases as clinically relevant effect size that one wishes to detect increases
Test: one or two tailed?
Software: don’t be alarmed if different software gives marginally differ sample sizes (slightly different approximations)
Strategies for maximizing power
• Use frequent outcomes or consider using a composite outcome as power is driven mainly by the number of events rather than the total sample size
• Use paired design (such as crossover trial)
• Use continuous variables, but remember to differentiate between statistical significance and clinical relevance
• Design study protocol to record follow-up data at a point when differences likely to be large
• Pool resources, e.g. a multi-center trial
Have you considered everything?
• Justified? E.g. budgetary constraints / ethical
• Missing any important covariates?
• Expect any drop outs?
• Unequal groups?
• Multiple comparisons / tests?
• Sensitivity analysis to assumptions made?
• Many post hoc adjustments and formulae adaptations available
Mid-trial sample size recalculation
For numerous reasons there might be reason to reconsider the
number of subjects after a trial has commenced
For example, if the sample size was powered to detect an effect
size d, but it is clear mid-way that the treatment effect is <d, but
that it might still be clinically beneficial
Authorities (e.g. FDA) used to say “no” – issues of credibility
Abandoning study and redesigning can lead to inflated Type I
errors
There are new statistical methods that can handle this though
Retrospective sample size calculations
Unnecessary and uninformative!
Hoenig JM, Heisey DM. The Abuse of Power: The
Pervasive Fallacy of Power Calculations for Data Analysis,
The American Statistician, 2001; 55: 19-24