july 2006samsi copyright, 1996 © dale carnegie & associates, inc. some thoughts on...
Post on 27-Mar-2015
219 Views
Preview:
TRANSCRIPT
July 2006 SAMSI
Copyright, 1996 © Dale Carnegie & Associates, Inc.
Some Thoughts on Replicability in ScienceSome Thoughts on Replicability in Science
Yoav BenjaminiYoav Benjamini
Tel Aviv UniversityTel Aviv University
www.math.tau.ac.il/~ybenjawww.math.tau.ac.il/~ybenja
YB SAMSI ‘06
Based on Joint work withBased on Joint work with
Ilan Golani
Department of Zoology, Tel Aviv University
Greg Elmer, Neri Kafkafi
Behavioral Neuroscience Branch, National Institute on Drug Abuse/IRP, Baltimore, Maryland
Dani Yekutieli, Anat Sakov, Ruth Heller, Rami Cohen,
Department of Statistics, Tel Aviv University
Dani Yekutieli, Yosi Hochberg
Department of Statistics, Tel Aviv University
YB SAMSI ‘06
Outline of LectureOutline of Lecture
1.1. PrologProlog
2.2. The replicablity problems in behavior geneticsThe replicablity problems in behavior genetics
3.3. Addressing strain*lab interactionAddressing strain*lab interaction
4.4. Addressing multiple endpoints Addressing multiple endpoints
5.5. The replicability problems in Medical StatisticsThe replicability problems in Medical Statistics
6.6. The replicability problems in Functional Magnetic The replicability problems in Functional Magnetic Resonance Imaging (fMRI)Resonance Imaging (fMRI)
7.7. EpilogEpilog
July 2006 SAMSI
1. Prolog1. Prolog
J.W.Tukey’s last paper (with Jones and Lewis)was an entry on Multiple Comparisons for the International Encyclopedia of Statistics.
It started with general discussion that multiple comparisons addresses ``a diversity of issues … that tend to be important, difficult, and often unresolved.”
•Multiple comparisons; Multiple determinations •Selection of one or more candidates; •Selection of variables; Selecting their transformations; etc. ( … his usual advice: there need not be a single best…)
July 2006 SAMSI
The Mixed PuzzleThe Mixed Puzzle
Then, the Encyclopedia entry included in detail two issues •The False Discovery Rate (FDR)
approach in pairwise comparisons
•The random effects vs fixed effects ANOVA
July 2006 SAMSI
"Two alternatives, 'fixed' and 'variable', are not enough.A good way to provide a reasonable amount of realism is to define 'c' by appropriate error term = f-error term +c [ r-error term - f-error term ]
…It pays then to learn as much as possible about values of c in the real world”…
but What’s that to do with Multiple Comparisons?
YB SAMSI ‘06
2. Behavior genetics2. Behavior genetics
• Study the genetics of behavioral traits: Study the genetics of behavioral traits:
Hearing, sight, smell, alcoholism, locomotion, fear, Hearing, sight, smell, alcoholism, locomotion, fear, exploratory behaviorexploratory behavior
• Compare behavior between inbred strains, crosses, Compare behavior between inbred strains, crosses, knockouts…knockouts…
• Number of behavioral endpoints ~200 and growingNumber of behavioral endpoints ~200 and growing
The entry Tukey wrote was about The entry Tukey wrote was about
ReplicabilityReplicability
YB SAMSI ‘06
The search for replicable scientific methodsThe search for replicable scientific methods
• Fisher’s The Design of Experiments (1935)Fisher’s The Design of Experiments (1935)
““In relation to the test of significance, we may say that In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable a phenomenon is experimentally demonstrable when we know how to conduct an experiment which when we know how to conduct an experiment which will rarely fail to give us statistically significant will rarely fail to give us statistically significant results.” (pp 14)results.” (pp 14)
i.e. significance level interpreted directly in replications i.e. significance level interpreted directly in replications of the experiment.of the experiment.
The discussion motivates the inclusions of results The discussion motivates the inclusions of results more extreme in the rejection regionmore extreme in the rejection region
YB SAMSI ‘06
ReplicabilityReplicability
“Behavior Genetics in transition” (Mann, Science ‘94)
“…jumping too soon to discoveries..”
(and press discoveries) raises the issue of Replicability
Mann identifies statistics as a major source of troubles yet yet did notdid not mention the two main themes we’ll mention the two main themes we’ll address.address.
The common cryThe common cry
Lack of standardization (e.g. Koln 2000)Lack of standardization (e.g. Koln 2000)
YB SAMSI ‘06
Does it work?Does it work?
• Crabbe et al (Science 1999) experiment at 3 labs:Crabbe et al (Science 1999) experiment at 3 labs:
• In spite of strict standardization,In spite of strict standardization,
they found: they found: Strain effect, Lab effectStrain effect, Lab effect
Lab*Strain InteractionLab*Strain Interaction
From their conclusionsFrom their conclusions::
““Thus, experiments characterizing mutants may yield Thus, experiments characterizing mutants may yield results that are idiosyncratic to a particular results that are idiosyncratic to a particular laboratory.”laboratory.”
“…“…differences between labs… can contribute to differences between labs… can contribute to failures to replicate results of Genetic Experiments” failures to replicate results of Genetic Experiments”
Whalsten(2001)Whalsten(2001)
YB SAMSI ‘06
A concrete example: exploratory behaviorA concrete example: exploratory behavior
NIH: Phenotyping Mouse Behavior High throughput screening of mutant NIH: Phenotyping Mouse Behavior High throughput screening of mutant micemice. . Comparing between 8 inbred strains of mice Dr. Ilan Golani TAU Dr. Elmer, MPRC , Dr Kafkafi, NIDA
Behavior Tracking
YB SAMSI ‘06
YB SAMSI ‘06
Using sophisticated data analytic tools we get for Using sophisticated data analytic tools we get for “segment acceleration” (log-transformed):“segment acceleration” (log-transformed):
YB SAMSI ‘06
The display supporting this claim for
Distance Traveled (m)
YB SAMSI ‘06
Source df MSE F p-value
Strain 7 102.5 44.8 0.00001
Lab 2 6.35 2.77 0.065
Lab*Strain 14 6.87 3.00 0.00028
Residuals 264 2.29
The statistical analysis supporting this claim
for prop. of time in center (logit)
and it is a common problem:
YB SAMSI ‘06
Fig. 2. The proportion of variance contributed by each factor, computed as its
Sum Square Error divided by the total Sum Square Error, for all endpoints.
Endpoints are sorted by their genotypic variance. Asterisks mark interaction terms
that were found significant by the FM in a level of 0.05.
0% 25% 50% 75% 100%
Stops per Excursion
Latency to Half Max Speed
Lingering Spatial Spread
Relative Activity Decrease
# Excursions
Homebase Relative Occupancy
Radius of Turn
Diversity
Lingering Mean Speed
Segment Acceleration
# Progression Segments
Center Time
Segment Length
Segment Max Speed
Rate of Turn
Lingering Time
Distance Traveled
Proportion of Total Variance
Genotype
Individual
Laboratory
Interaction
****
*
**
*
**
*
Kafkafi&YB et al, PNAS ‘05
YB SAMSI ‘06
Our statistical diagnosis Our statistical diagnosis of the replicability problemof the replicability problem
Part I. Using the wrong “yardstick for variability”Part I. Using the wrong “yardstick for variability”
Fixed Model analysis Fixed Model analysis
treating labs’ effects as fixedtreating labs’ effects as fixed
Part II. Multiplicity problemsPart II. Multiplicity problems
many endpoints; repeated testing many endpoints; repeated testing (screening)(screening)
(Kafkafi &YB et al, PNAS ‘05)(Kafkafi &YB et al, PNAS ‘05)
YB SAMSI ‘06
3. Part 1: The mixed model 3. Part 1: The mixed model
• The existence of Lab*Strain interaction does not The existence of Lab*Strain interaction does not diminish the credibility of a behavioral endpoint - in diminish the credibility of a behavioral endpoint - in this sense this sense it is not a problemit is not a problem
• This interaction should be recognized as “ a fact of This interaction should be recognized as “ a fact of life”life”
• Interaction’s size is the right “yardstick” against Interaction’s size is the right “yardstick” against which genetic differences should be comparedwhich genetic differences should be compared
Statistically speaking:Statistically speaking:
Lab is a random factor, as is its interaction with strain.Lab is a random factor, as is its interaction with strain.
A A mixedmixed model should be used (rather than model should be used (rather than fixed)fixed)
YB SAMSI ‘06
The formal Mixed ModelThe formal Mixed Model
YLGI is the value of an endpoint for Laboratory L, Strain S,
index I represents the repetition within each group .
YLSI = S + aL +bL*S + LSI
G is the strain effect which is considered fixed,
aL ~ N(0, LAB) is the laboratory random effect,
bL*S ~ N(0, LAB*STRAIN ) is the interaction random effect,
LSI ~ N(0, ) is the individual variability
YB SAMSI ‘06
Implications of the Mixed ModelImplications of the Mixed Model
Source df MSE F p-value
Strain 7 102.5 44.8 0.00001
Lab 2 6.35 2.77 0.09
Lab*Strain 14 6.87 3.00 0.00028
Residuals 264 2.29
0.914.8 0.0028
Esimates of LAB and
LAB*STRAIN
•Technically
The threshold for significant strain differences
can be much higher
0.43
YB SAMSI ‘06
Implications of the Mixed ModelImplications of the Mixed Model
•Practically
1. For screening new mutants significance assessed
LabLab*Strain +
/n
2. For screening new mutants vs locally
measured background significance assessed
Lab*Strain +
(1/n+1/m)
Unfortunately, even as sample sizes increase the interaction term does not disappear
YB SAMSI ‘06
Implications of the Mixed ModelImplications of the Mixed Model
3. For developing new endpoints
A single lab cannot suffice for the development
of a new endpoint -
no yardstick is available
Thus it’s the developer’s responsibility to offer estimates
of interaction variability and put it in a public dataset
(such as Jackson Laboratory)
YB SAMSI ‘06
Behavioral EndpointBehavioral Endpoint FixedFixed MixedMixed
Prop. Lingering TimeProp. Lingering Time 0.000010.00001 0.00290.0029
# Progression segments# Progression segments 0.000010.00001 0.00680.0068
Median Turn Radius (scaled)Median Turn Radius (scaled) 0.000010.00001 0.00920.0092
Time away from wallTime away from wall 0.000010.00001 0.01080.0108
Distance traveledDistance traveled 0.000010.00001 0.01440.0144
AccelerationAcceleration 0.000010.00001 0.01460.0146
# Excursions# Excursions 0.000010.00001 0.01780.0178
Time to half max speedTime to half max speed 0.000010.00001 0.02040.0204
Max speed wall segmentsMax speed wall segments 0.000010.00001 0.02570.0257
Median Turn rateMedian Turn rate 0.000010.00001 0.03200.0320
Spatial spreadSpatial spread 0.000010.00001 0.03880.0388
Lingering mean speedLingering mean speed 0.000010.00001 0.05880.0588
Homebase occupancyHomebase occupancy 0.0010.001 0.07120.0712
# stops per excursion# stops per excursion 0.00280.0028 0.12020.1202
Stop diversityStop diversity 0.0270.027 0.14890.1489
Length of progression segmentsLength of progression segments 0.440.44 0.51500.5150
Activity decreaseActivity decrease 0.670.67 0.88750.8875
Significance of 8 Strain differences
YB SAMSI ‘06
In summary of part IIn summary of part I
Practically, the threshold for making discoveries, in all aspects, is set at a higher level
Is this a drawback?
It is a way to weed out non-replicable differences
YB SAMSI ‘06
What about the warning:What about the warning:
““Never use a random factor unless your levels are a Never use a random factor unless your levels are a true random sample”true random sample”
• We do not agreeWe do not agree: Replicability in a new lab, is at : Replicability in a new lab, is at least partially captured by a random effect model.least partially captured by a random effect model.
• Revisit Jones Lewis and Tukey (2002)Revisit Jones Lewis and Tukey (2002)
July 2006 SAMSI
"Two alternatives, 'fixed' and 'variable', are not enough.A good way to provide a reasonable amount of realism is to define 'c' by appropriate error term = f-error term +c [ r-error term - f-error term ]
so that 'everything fixed' corresponds to c=0 and'random' is a particular case of c=1.It pays then to learn as much as possible about values of c in the real world”
c=0.5 fixed columns and illustrative weightsc=1.6 illustrative columns
A challenge: can we estimate c?
YB SAMSI ‘06
Behavioral EndpointBehavioral Endpoint MixedMixed
Prop. Lingering TimeProp. Lingering Time 0.00290.0029
# Progression segments# Progression segments 0.00680.0068
Median Turn Radius (scaled)Median Turn Radius (scaled) 0.00920.0092
Time away from wallTime away from wall 0.01080.0108
Distance traveledDistance traveled 0.01440.0144
AccelerationAcceleration 0.01460.0146
# Excursions# Excursions 0.01780.0178
Time to half max speedTime to half max speed 0.02040.0204
Max speed wall segmentsMax speed wall segments 0.02570.0257
Median Turn rateMedian Turn rate 0.03200.0320
Spatial spreadSpatial spread 0.03880.0388
Lingering mean speedLingering mean speed 0.05880.0588
Homebase occupancyHomebase occupancy 0.07120.0712
# stops per excursion# stops per excursion 0.12020.1202
Stop diversityStop diversity 0.14890.1489
Length of progression segmentsLength of progression segments 0.51500.5150
Activity decreaseActivity decrease 0.88750.8875
Significance of 8 Strain differences
Should we believe
all p-value ≤ 0.05?
Not necessarily -
Beware of Multiplicity!
YB SAMSI ‘06
4. Part II: The Multiplicity Problem4. Part II: The Multiplicity Problem• The more statistical tests in a study - the larger the The more statistical tests in a study - the larger the
probability of making a type I errorprobability of making a type I error
• Stricter control - less power to discover a real effectStricter control - less power to discover a real effect
Traditional approaches Traditional approaches
• ““Don’t Don’t worryworry, , bebe happyhappy””
Conduct each test at the usual .05 level (eg Conduct each test at the usual .05 level (eg Ionadis’ PLOS paper )Ionadis’ PLOS paper )
• Panic!Panic!
Control the prob. of Control the prob. of making even a single type I making even a single type I error in the entireerror in the entire study at the usual level study at the usual level (e.g. (e.g. Bonferroni)Bonferroni)
Panic causes severe loss of power to discover in Panic causes severe loss of power to discover in large problemslarge problems
YB SAMSI ‘06
Behavioral EndpointBehavioral Endpoint MixedMixed
Prop. Lingering TimeProp. Lingering Time 0.00290.0029
# Progression segments# Progression segments 0.00680.0068
Median Turn Radius (scaled)Median Turn Radius (scaled) 0.00920.0092
Time away from wallTime away from wall 0.01080.0108
Distance traveledDistance traveled 0.01440.0144
AccelerationAcceleration 0.01460.0146
# Excursions# Excursions 0.01780.0178
Time to half max speedTime to half max speed 0.02040.0204
Max speed wall segmentsMax speed wall segments 0.02570.0257
Median Turn rateMedian Turn rate 0.03200.0320
Spatial spreadSpatial spread 0.03880.0388
Lingering mean speedLingering mean speed 0.05880.0588
Homebase occupancyHomebase occupancy 0.07120.0712
# stops per excursion# stops per excursion 0.12020.1202
Stop diversityStop diversity 0.14890.1489
Length of progression segmentsLength of progression segments 0.51500.5150
Activity decreaseActivity decrease 0.88750.8875
Significance of 8 Strain differences
Should we believe
all p-value ≤ 0.05?
Not necessarily -
Beware of Multiplicity!
Should we use
Bonferroni?
0.05*1/17=.0029
Panic causes severe Panic causes severe
loss of power loss of power
to discover to discover
in large problemsin large problems
YB SAMSI ‘06
• ““Genetic dissection of complex traits: guidelines for interpreting…”Genetic dissection of complex traits: guidelines for interpreting…” Lander and Kruglyak Lander and Kruglyak
• ““Adopting too lax a standard guarantees a Adopting too lax a standard guarantees a burgeoning literature of false positive linkage burgeoning literature of false positive linkage claims, each with its own symbol… Scientific claims, each with its own symbol… Scientific disciplines erode their credibility disciplines erode their credibility when substantial when substantial proportion of claims cannot be replicatedproportion of claims cannot be replicated…” …”
• ““On the other hand, adopting too high a hurdle for On the other hand, adopting too high a hurdle for reporting results runs the risk that nascent field will reporting results runs the risk that nascent field will be stillborn.” be stillborn.”
Is there an in-between approach?Is there an in-between approach?
YB SAMSI ‘06
The False Discovery Rate (FDR) criterionThe False Discovery Rate (FDR) criterion
The FDR approach takes seriously the concern of Lander & Kruglyak:The FDR approach takes seriously the concern of Lander & Kruglyak:
The error in the entire study is measured byThe error in the entire study is measured by
QQ = = the proportion of false discoveries among the discoveriesthe proportion of false discoveries among the discoveries
= 0 if none found, and= 0 if none found, and
FDR = E(Q)FDR = E(Q)
• If nothing is “real”, If nothing is “real”,
controlling the FDR at level controlling the FDR at level qq guarantees that the probability of guarantees that the probability of making even one false discovery is less than making even one false discovery is less than qq
This is why we choose usual levels of This is why we choose usual levels of qq, say 0.05, say 0.05
• But otherwise But otherwise there is room for improving detection power.there is room for improving detection power.
• This error rate is scalable; adaptive; economically interpretable.This error rate is scalable; adaptive; economically interpretable.
YB SAMSI ‘06
Our motivating work was Soriç (JASA 1989): Our motivating work was Soriç (JASA 1989):
If we use size 0.05 tests to decide upon statistical discoveries thenIf we use size 0.05 tests to decide upon statistical discoveries then
““there is danger that a large part of science is not true”there is danger that a large part of science is not true”
We define We define Q Q = V/R= V/R if R > 0if R > 0
= 0 = 0 if R = 0if R = 0
Soriç used E(V)/R for his demonstrations.Soriç used E(V)/R for his demonstrations.
More recently Ioannidis (PLoS Medicine ‘05) just repeated the More recently Ioannidis (PLoS Medicine ‘05) just repeated the argument using The Positive Predictive Value (PPV)argument using The Positive Predictive Value (PPV)
PPV = 1-QPPV = 1-Q
stating stating “most published research findings are false”“most published research findings are false”
For demonstration in his model he used For demonstration in his model he used
PPVPPV = 1-E(V)/E(R) = 1-FDR = 1-E(V)/E(R) = 1-FDR
Control of FDR assures large PPV under most of Ioannidis’ scenariosControl of FDR assures large PPV under most of Ioannidis’ scenarios
(except biases such as omission, publication, and interest)(except biases such as omission, publication, and interest)
YB SAMSI ‘06
Behavioral EndpointBehavioral Endpoint MixedMixed
Prop. Lingering TimeProp. Lingering Time 0.00290.0029 0.05*0.05*1/171/17=.0029=.0029
# Progression segments# Progression segments 0.00680.0068
Median Turn Radius (scaled)Median Turn Radius (scaled) 0.00920.0092
Time away from wallTime away from wall 0.01080.0108
Distance traveledDistance traveled 0.01440.0144
AccelerationAcceleration 0.01460.0146
# Excursions# Excursions 0.01780.0178
Time to half max speedTime to half max speed 0.02040.0204
Max speed wall segmentsMax speed wall segments 0.02570.0257 0.05* 0.05* 99/17=/17=0.02670.0267
Median Turn rateMedian Turn rate 0.03200.0320 0.05*0.05*1010/17=0.0294/17=0.0294
Spatial spreadSpatial spread 0.03880.0388 0.05*0.05*1111/17=0.0323/17=0.0323
Lingering mean speedLingering mean speed 0.05880.0588
Homebase occupancyHomebase occupancy 0.07120.0712
# stops per excursion# stops per excursion 0.12020.1202
Stop diversityStop diversity 0.14890.1489 0.05*0.05*1515/17/17
Length of progression segmentsLength of progression segments 0.51500.5150 0.05*0.05*1616/17/17
Activity decreaseActivity decrease 0.88750.8875 0.05*0.05*17/1717/17
Significance of 8 Strain differences
July 2006 SAMSI
Addressing multiplicity by controlling the FDR Addressing multiplicity by controlling the FDR
FDR control is a very active area of current researchmainly because of its scalability (even into millions…)
types of dependency resampling procedures FDR adjusted p-valuesadaptive proceduresBayesian interpretationsRelated error ratesModel selection
July 2006 SAMSI
Is there a Is there a mixed-multiplicitymixed-multiplicity connection? connection?
Recall in the puzzling entry to the Encyclopedia only two issues were addressed in detail •FDR approach in pairwise comparisons
•The random effects vs fixed effects problem
YB SAMSI ‘06
The mixed-multiplicity connectionThe mixed-multiplicity connection
• In the fixed framework we selected three labs and In the fixed framework we selected three labs and made our analysis as if this is our entire world of made our analysis as if this is our entire world of referencereference
• When this is not the case - as when the experiment When this is not the case - as when the experiment is repeated in a different lab - the fixed point of view is repeated in a different lab - the fixed point of view is overly optimisticis overly optimistic
• This is also an essence of the multiplicity problem - This is also an essence of the multiplicity problem - say selecting the maximal difference (with the say selecting the maximal difference (with the smallest p-value) and treating it as if it is our only smallest p-value) and treating it as if it is our only comparisoncomparison
In both cases, conclusions from a naïve point of view In both cases, conclusions from a naïve point of view have too great a chance to be non-replicablehave too great a chance to be non-replicable
YB SAMSI ‘06
5. Replicability in Medical Research5. Replicability in Medical ResearchHormone therapy in postmenopausal womenHormone therapy in postmenopausal women
• A very large and long randomized controlled study A very large and long randomized controlled study (Women’s Health Initiative, Rossouw, Anderson, (Women’s Health Initiative, Rossouw, Anderson, Prentice, LaCroix, JAMA,2002)Prentice, LaCroix, JAMA,2002)
Study wasStudy was not not performed for performed for drug approvaldrug approval
• Was stopped before completion because expected Was stopped before completion because expected effects were reversedeffects were reversed
• Bonferroni-adjusted and marginal (nominal) CIs Bonferroni-adjusted and marginal (nominal) CIs reportedreported
• The conclusions contradictory: The decision to stop The conclusions contradictory: The decision to stop the trial was based on the marginal CIsthe trial was based on the marginal CIs
YB SAMSI ‘06
The editorialThe editorial
““The authors present both nominal and The authors present both nominal and rarely usedrarely used adjusted CIsadjusted CIs to to take into account multiple testing, to to take into account multiple testing, thus widening the CIs. Whether such adjustment thus widening the CIs. Whether such adjustment should be used has been questioned, ... ". should be used has been questioned, ... ".
(Fletcher and Colditz, 2002)(Fletcher and Colditz, 2002)
Our Puzzle: US and European Regulatory Bodies Our Puzzle: US and European Regulatory Bodies require adjusting results in clinical trial to require adjusting results in clinical trial to multiplicity.multiplicity.
So,So, is the statement true?is the statement true?
Small Meta-analysis of Methods Small Meta-analysis of Methods (with Rami Cohen) (with Rami Cohen)
Check with the flagship of medical research:Check with the flagship of medical research:
YB SAMSI ‘06
Sampling the New England J of MedicineSampling the New England J of Medicine
• Period: 3 half-years, 2000,2002,2004Period: 3 half-years, 2000,2002,2004
• All articles of length > 6 pages; containing at least All articles of length > 6 pages; containing at least once “p=“once “p=“
• Sample of 20 from each period: 60 articlesSample of 20 from each period: 60 articles
No differences between periods - results reported No differences between periods - results reported pooled over periodspooled over periods
• 44/60 reported clinical trials’ results44/60 reported clinical trials’ results
YB SAMSI ‘06
How was multiplicity addressed?How was multiplicity addressed?
Type of CorrectionType of Correction # of Articles# of Articles
BonferroniBonferroni 6 (2)6 (2)
Obrian-FlemingObrian-Fleming 11
HochbergHochberg 11
HolmHolm 11
Lan-DeMetsLan-DeMets 11
More than 3 SDMore than 3 SD 11
Primary at .04;Primary at .04;
Two Secondary at .0175Two Secondary at .0175
11
NoneNone 4747
(out of 60 articles)
YB SAMSI ‘06
Success: All studies define primary endpointsSuccess: All studies define primary endpoints
Frequencies of number of endpoints
0
5
10
15
20
25
30
35
40
45
50
Nuber in Paper
Fre
qu
en
cy
Primary EndpointsSecondary Endpoints
YB SAMSI ‘06
Multiple endpointsMultiple endpoints
• No article had a single endpointNo article had a single endpoint
• 2 articles only corrected for multiple endpoints2 articles only corrected for multiple endpoints
• 80% define a single primary endpoint80% define a single primary endpoint
• In many cases there is no clear distinction between In many cases there is no clear distinction between primary and secondary endpointsprimary and secondary endpoints
• Even when a correction was made it was adjusted for a partial list.
(Note: Rami vs Yoav)
YB SAMSI ‘06
Multiple Conf. Intervals: two different concernsMultiple Conf. Intervals: two different concerns
• The effect of SimultaneityThe effect of Simultaneity
Pr(Pr(allall intervals cover their parameters) < 0.95 intervals cover their parameters) < 0.95
The goal of The goal of SimultaneousSimultaneous CIs, such as Bonferroni- CIs, such as Bonferroni-adjusted CIs, is to assure that Pr( adjusted CIs, is to assure that Pr( allall cover) ≥ 0.95 cover) ≥ 0.95
• The effect of Selection The effect of Selection
When only a subset of the parameters is selected for When only a subset of the parameters is selected for highlighting, for example the significant findings, highlighting, for example the significant findings, even the average coverage < 0.95even the average coverage < 0.95
YB SAMSI ‘06
Implications of selection on average coverage Implications of selection on average coverage
2/11 do not cover with no selection
2/3 do not cover when selecting significant coefficients
(BY & Yekutieli ‘05: FDR ideas for confidence intervals)(BY & Yekutieli ‘05: FDR ideas for confidence intervals)
YB SAMSI ‘06
So what?So what?
In MCP2005 conference in ShankhaiIn MCP2005 conference in Shankhai
Head of statistical unit in American FDA brought Head of statistical unit in American FDA brought amazing numbersamazing numbers
More than half of the Phase III studies fail to show the More than half of the Phase III studies fail to show the effect they were designed to show.effect they were designed to show.
Is it at least partly because clinical trials are analyzed Is it at least partly because clinical trials are analyzed “loosely” in terms of multiplicity before standing up “loosely” in terms of multiplicity before standing up to the regulatory agencies, and thus their results to the regulatory agencies, and thus their results are not replicable?are not replicable?
More comments in view of Ionadis’ paper at a later More comments in view of Ionadis’ paper at a later timetime
YB SAMSI ‘06
6. Functional Magnetic Resonance Imaging (fMRI)6. Functional Magnetic Resonance Imaging (fMRI)
• Study of the functioning brain: Where is the brain Study of the functioning brain: Where is the brain active when we perform a mental task?active when we perform a mental task?
• ExampleExample
YB SAMSI ‘06
Functional Magnetic Resonance Imaging (fMRI)Functional Magnetic Resonance Imaging (fMRI)
• Study of the functioning brain: Where is the brain Study of the functioning brain: Where is the brain active when we perform a mental task?active when we perform a mental task?
• ExampleExample
YB SAMSI ‘06
Unit of data is Unit of data is vovolume pilume pixelxel - - Voxel Voxel 64 x 64 per slice x 16 64 x 64 per slice x 16
A comparison of experimental factor per voxel: inference onA comparison of experimental factor per voxel: inference on~ 64K voxels~ 64K voxels
YB SAMSI ‘06
Assuring replicability in fMRI Analysis: Part IIAssuring replicability in fMRI Analysis: Part II
MultiplicityMultiplicity was addressed early on was addressed early on
1.1. Controlling the prob. of making a false discovery even in a single Controlling the prob. of making a false discovery even in a single voxel: voxel:
1.1. FWE control using random field theory of level sets FWE control using random field theory of level sets (Worsley & Friston ‘95, Adler’s (Worsley & Friston ‘95, Adler’s
theory)theory)
2.2. FWE control with resampling FWE control with resampling (Nichols & Holmes ‘96)(Nichols & Holmes ‘96)
3.3. Extra power by limiting # voxels tested using Extra power by limiting # voxels tested using RRegions egions OOf f IInterest (nterest (ROIROI)) from an independent studyfrom an independent study
2.2. Genovese, Lazar & Nichols (‘02) introduced FDR into fMRI Genovese, Lazar & Nichols (‘02) introduced FDR into fMRI analysis. analysis.
FDR voxel-based analysis available in most software packages FDR voxel-based analysis available in most software packages e.g. Brain Voyager, SPM, fMRIstate.g. Brain Voyager, SPM, fMRIstat
YB SAMSI ‘06
fMRI: More to do on the multiplicity frontfMRI: More to do on the multiplicity front
• Working with regions rather than voxels:Working with regions rather than voxels:
– Utilizing activity is in regionsUtilizing activity is in regions
– Defining appropriate FDR on regions Defining appropriate FDR on regions
– Trimming non-active voxels from active regionsTrimming non-active voxels from active regions
• Using adaptive FDR proceduresUsing adaptive FDR procedures
etc.etc.
Pacifico et al (‘04)Pacifico et al (‘04) Heller et al (‘05) Heller & YB (‘06)Heller et al (‘05) Heller & YB (‘06)
YB SAMSI ‘06
““The Good the Bad and the Ugly”The Good the Bad and the Ugly”
YB SAMSI ‘06
““The Good the Bad and the Ugly”The Good the Bad and the Ugly”
YB SAMSI ‘06
Assuring replicability in fMRI Analysis: Part IAssuring replicability in fMRI Analysis: Part I
Initially results were reported for each subject separately.Initially results were reported for each subject separately.
Then fixed model ANOVA was used to analyze the Then fixed model ANOVA was used to analyze the multiple subjectsmultiple subjects - -
within subject “yardstick for variability” only.within subject “yardstick for variability” only.
Concern aboutConcern about between subject variability between subject variability was raised was raised more recently.more recently.
Mixed models analysis, with random effects for subjects, Mixed models analysis, with random effects for subjects, is now available in the main software tools. is now available in the main software tools. This is called This is called multi-subjectmulti-subject analysis analysis
Obviously, the number of degrees of freedom is much Obviously, the number of degrees of freedom is much smaller and the variability is largersmaller and the variability is larger
YB SAMSI ‘06
Multi-Subjectusing random effects
Single subject
YB SAMSI ‘06
Tricks of the trade:
Using correlation at first session as pilot q1=2/3
at second session at q2=.075
YB SAMSI ‘06
YB SAMSI ‘06
YB SAMSI ‘06
Why is Random Effects (mixed model) Why is Random Effects (mixed model) analysis insensitiveanalysis insensitive
1.1. Variability between subjects about location of Variability between subjects about location of activity and activity and
2.2. Task specific variability of location between Task specific variability of location between subjectssubjects
3.3. Variability about the size of the activityVariability about the size of the activity
4.4. Problems in mapping different subjects to a single Problems in mapping different subjects to a single map of the brainmap of the brain
5.5. Use of uniform smoothing across the brain to solve Use of uniform smoothing across the brain to solve problem (3) reduces the signal per voxel problem (3) reduces the signal per voxel
6.6. Pattern of change in signal along time differs Pattern of change in signal along time differs between individualsbetween individuals
YB SAMSI ‘06
EpilogEpilog
• The debate The debate Fixed-vs-RandomFixed-vs-Random is fierce in the is fierce in the community of neuroimagers. Acceptance to the community of neuroimagers. Acceptance to the best journals seems to depend on chance: will the best journals seems to depend on chance: will the article meet a “Random Effects Referee”?article meet a “Random Effects Referee”?
• One can read that “the multi subject analysis is less One can read that “the multi subject analysis is less sensitive, so results were not corrected for sensitive, so results were not corrected for multiplicity”multiplicity”
• Researchers sometimes resort to more Researchers sometimes resort to more questionable ways to control for multiplicity and questionable ways to control for multiplicity and across subject variabilityacross subject variability
• e.g. Conjunction Analysis e.g. Conjunction Analysis
YB SAMSI ‘06
EpilogEpilogUsing the statistic TUsing the statistic Tvivi to test for each subject i: to test for each subject i:
HH0vi0vi: there is no effect at voxel v for subject i: there is no effect at voxel v for subject i
Conjunction analysisConjunction analysis: intersect the individual subjects’ maps : intersect the individual subjects’ maps
Friston, Worseley and others compare TFriston, Worseley and others compare Tvv=min =min 1≤i≤n1≤i≤n T Tvi vi to a to a
(lower(lower) random field theory based threshold. ) random field theory based threshold.
Nichols et al (‘05): the “complete null” is tested at each voxel, so a Nichols et al (‘05): the “complete null” is tested at each voxel, so a rejection merely indicates that at least for one subject there is rejection merely indicates that at least for one subject there is an effect at the voxel. an effect at the voxel.
(IS IT ENOUGH TO ASSURE REPLICABILITY?) (IS IT ENOUGH TO ASSURE REPLICABILITY?)
Instead TInstead Tv v should be compared to the regular threshold to test the should be compared to the regular threshold to test the
hypothesis that all have effect, and then multiplicity strictly hypothesis that all have effect, and then multiplicity strictly controlled.controlled.
YB SAMSI ‘06
EpilogEpilog
Friston et al (‘05) object to this proposal, because of Friston et al (‘05) object to this proposal, because of loss of power; they suggest testing: “there is effect loss of power; they suggest testing: “there is effect in at least u out of n subjects” as the alternative, in at least u out of n subjects” as the alternative, and then strictly control multiplicity*and then strictly control multiplicity*
It is clear that a compromise that addresses It is clear that a compromise that addresses replicability - both multiplicity and between subject replicability - both multiplicity and between subject variability - and sensitivity is needed. variability - and sensitivity is needed.
Is this the case where Tukey’s ideas about 0<c<1 may Is this the case where Tukey’s ideas about 0<c<1 may become essential? become essential?
It may very well be!It may very well be!
YB SAMSI ‘06
In summaryIn summary
Assuring the replicability of results of an experiment is Assuring the replicability of results of an experiment is at the heart of the scientific dogmaat the heart of the scientific dogma
Watch out for two statistical dangers to replicabilityWatch out for two statistical dangers to replicability
– Ignoring the variability in those selected to be Ignoring the variability in those selected to be studied - thus using the wrong “yardstick for studied - thus using the wrong “yardstick for variability”variability”
– Selecting to emphasize your best results Selecting to emphasize your best results
The second problem emerges naturally when The second problem emerges naturally when multiple inferences are made and multiplicity is multiple inferences are made and multiplicity is ignored. ignored.
YB SAMSI ‘06
The FDR website The FDR website www.math.tau.ac.il/~ybenjawww.math.tau.ac.il/~ybenja
YB SAMSI ‘06
YB SAMSI ‘06
MCP type Clinical trial
Number of families
Number of correction
Percent of correction
Yes 27 3 11% Multiple subsets No 10 0 0% Yes 19 4 21% Multiple peeks No 6 1 16% Yes 7 2 28% Many-to-one No 4 1 25% Yes 5 0 0% Pairwise comparison No 0 0 0% Yes 58 9 16% All No 20 2 10%
All Articles All 78 11 14%
Further details about the failures to adjust
out of 60 articles
YB SAMSI ‘06
The False Discovery Rate (FDR) criterionThe False Discovery Rate (FDR) criterion
Benjamini and Hochberg (95)Benjamini and Hochberg (95)
RR = # rejected hypotheses = # discoveries = # rejected hypotheses = # discoveries
VV of these may be in error = # false discoveries of these may be in error = # false discoveries
The error (type I) in the entire study is measured byThe error (type I) in the entire study is measured by
Q V
RR 0
0 R 0
i.e. Q is i.e. Q is the proportion of false discoveries among the the proportion of false discoveries among the discoveriesdiscoveries (0 if none found) (0 if none found)
FDR = E(Q)FDR = E(Q)
Does it make sense?Does it make sense?
YB SAMSI ‘06
Does it make sense?Does it make sense?• Inspecting 20 features:Inspecting 20 features:
11 false among 20 discovered - false among 20 discovered - bearablebearable
11 false among 2 discovered - false among 2 discovered - unbearableunbearable
This error rate is adaptive and has also economic interpretationThis error rate is adaptive and has also economic interpretation
• Inspecting 100 features the above remains the sameInspecting 100 features the above remains the same
So this error rate is also scalableSo this error rate is also scalable
• If nothing is “real”, If nothing is “real”,
controlling the FDR at level controlling the FDR at level qq guarantees that the probability of guarantees that the probability of making even one false discovery is less than making even one false discovery is less than qq
• This is why we choose usual levels of This is why we choose usual levels of qq, say 0.05, say 0.05
• But otherwise But otherwise there is room for improving detection power:there is room for improving detection power:
YB SAMSI ‘06
FDR controlling proceures.FDR controlling proceures.
Linear step up procedureLinear step up procedure (BH, FDR) (BH, FDR)
PPii be the observed p-value of a test for be the observed p-value of a test for HHii i=1,2,…m i=1,2,…m
• Order the p-values Order the p-values PP(1)(1) ≤ P ≤ P(2)(2) ≤…≤ P ≤…≤ P(m)(m)
• Let Let
• Reject Reject
kmax{i:p(i)(i / m)q}
H(1),H(2),...,H(k )
YB SAMSI ‘06
FDR control of Linear StepUp Procedure (BH)FDR control of Linear StepUp Procedure (BH)
Suppose mSuppose m00 ≤ m of the hypotheses are true ≤ m of the hypotheses are true
If the test statistics are If the test statistics are
independent, or positive dependent :independent, or positive dependent :
in generalin general ::
normally distributednormally distributed
FDR m0
mq q
FDR q
FDR m0
mq (1 1/ 2 1/ 3 1/ m)
m0
mq log(m)
top related