july 2006samsi copyright, 1996 © dale carnegie & associates, inc. some thoughts on...

July 2006 SAMSI

Copyright, 1996 © Dale Carnegie & Associates, Inc.

Some Thoughts on Replicability in ScienceSome Thoughts on Replicability in Science

Yoav BenjaminiYoav Benjamini

Tel Aviv UniversityTel Aviv University

www.math.tau.ac.il/~ybenjawww.math.tau.ac.il/~ybenja

YB SAMSI ‘06

Based on Joint work withBased on Joint work with

Ilan Golani

Department of Zoology, Tel Aviv University

Greg Elmer, Neri Kafkafi

Behavioral Neuroscience Branch, National Institute on Drug Abuse/IRP, Baltimore, Maryland

Dani Yekutieli, Anat Sakov, Ruth Heller, Rami Cohen,

Department of Statistics, Tel Aviv University

Dani Yekutieli, Yosi Hochberg

Department of Statistics, Tel Aviv University

YB SAMSI ‘06

Outline of LectureOutline of Lecture

1.1. PrologProlog

2.2. The replicablity problems in behavior geneticsThe replicablity problems in behavior genetics

3.3. Addressing strain*lab interactionAddressing strain*lab interaction

4.4. Addressing multiple endpoints Addressing multiple endpoints

5.5. The replicability problems in Medical StatisticsThe replicability problems in Medical Statistics

6.6. The replicability problems in Functional Magnetic The replicability problems in Functional Magnetic Resonance Imaging (fMRI)Resonance Imaging (fMRI)

7.7. EpilogEpilog

July 2006 SAMSI

1. Prolog1. Prolog

J.W.Tukey’s last paper (with Jones and Lewis)was an entry on Multiple Comparisons for the International Encyclopedia of Statistics.

It started with general discussion that multiple comparisons addresses ``a diversity of issues … that tend to be important, difficult, and often unresolved.”

•Multiple comparisons; Multiple determinations •Selection of one or more candidates; •Selection of variables; Selecting their transformations; etc. ( … his usual advice: there need not be a single best…)

July 2006 SAMSI

The Mixed PuzzleThe Mixed Puzzle

Then, the Encyclopedia entry included in detail two issues •The False Discovery Rate (FDR)

approach in pairwise comparisons

•The random effects vs fixed effects ANOVA

July 2006 SAMSI

"Two alternatives, 'fixed' and 'variable', are not enough.A good way to provide a reasonable amount of realism is to define 'c' by appropriate error term = f-error term +c [ r-error term - f-error term ]

…It pays then to learn as much as possible about values of c in the real world”…

but What’s that to do with Multiple Comparisons?

YB SAMSI ‘06

2. Behavior genetics2. Behavior genetics

• Study the genetics of behavioral traits: Study the genetics of behavioral traits:

Hearing, sight, smell, alcoholism, locomotion, fear, Hearing, sight, smell, alcoholism, locomotion, fear, exploratory behaviorexploratory behavior

• Compare behavior between inbred strains, crosses, Compare behavior between inbred strains, crosses, knockouts…knockouts…

• Number of behavioral endpoints ~200 and growingNumber of behavioral endpoints ~200 and growing

The entry Tukey wrote was about The entry Tukey wrote was about

ReplicabilityReplicability

YB SAMSI ‘06

The search for replicable scientific methodsThe search for replicable scientific methods

• Fisher’s The Design of Experiments (1935)Fisher’s The Design of Experiments (1935)

““In relation to the test of significance, we may say that In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable a phenomenon is experimentally demonstrable when we know how to conduct an experiment which when we know how to conduct an experiment which will rarely fail to give us statistically significant will rarely fail to give us statistically significant results.” (pp 14)results.” (pp 14)

i.e. significance level interpreted directly in replications i.e. significance level interpreted directly in replications of the experiment.of the experiment.

The discussion motivates the inclusions of results The discussion motivates the inclusions of results more extreme in the rejection regionmore extreme in the rejection region

YB SAMSI ‘06

ReplicabilityReplicability

“Behavior Genetics in transition” (Mann, Science ‘94)

“…jumping too soon to discoveries..”

(and press discoveries) raises the issue of Replicability

Mann identifies statistics as a major source of troubles yet yet did notdid not mention the two main themes we’ll mention the two main themes we’ll address.address.

The common cryThe common cry

Lack of standardization (e.g. Koln 2000)Lack of standardization (e.g. Koln 2000)

YB SAMSI ‘06

Does it work?Does it work?

• Crabbe et al (Science 1999) experiment at 3 labs:Crabbe et al (Science 1999) experiment at 3 labs:

• In spite of strict standardization,In spite of strict standardization,

they found: they found: Strain effect, Lab effectStrain effect, Lab effect

Lab*Strain InteractionLab*Strain Interaction

From their conclusionsFrom their conclusions::

““Thus, experiments characterizing mutants may yield Thus, experiments characterizing mutants may yield results that are idiosyncratic to a particular results that are idiosyncratic to a particular laboratory.”laboratory.”

“…“…differences between labs… can contribute to differences between labs… can contribute to failures to replicate results of Genetic Experiments” failures to replicate results of Genetic Experiments”

Whalsten(2001)Whalsten(2001)

YB SAMSI ‘06

A concrete example: exploratory behaviorA concrete example: exploratory behavior

NIH: Phenotyping Mouse Behavior High throughput screening of mutant NIH: Phenotyping Mouse Behavior High throughput screening of mutant micemice. . Comparing between 8 inbred strains of mice Dr. Ilan Golani TAU Dr. Elmer, MPRC , Dr Kafkafi, NIDA

Behavior Tracking

YB SAMSI ‘06

Using sophisticated data analytic tools we get for Using sophisticated data analytic tools we get for “segment acceleration” (log-transformed):“segment acceleration” (log-transformed):

YB SAMSI ‘06

The display supporting this claim for

Distance Traveled (m)

YB SAMSI ‘06

Source df MSE F p-value

Strain 7 102.5 44.8 0.00001

Lab 2 6.35 2.77 0.065

Lab*Strain 14 6.87 3.00 0.00028

Residuals 264 2.29

The statistical analysis supporting this claim

for prop. of time in center (logit)

and it is a common problem:

YB SAMSI ‘06

Fig. 2. The proportion of variance contributed by each factor, computed as its

Sum Square Error divided by the total Sum Square Error, for all endpoints.

Endpoints are sorted by their genotypic variance. Asterisks mark interaction terms

that were found significant by the FM in a level of 0.05.

0% 25% 50% 75% 100%

Stops per Excursion

Latency to Half Max Speed

Lingering Spatial Spread

Relative Activity Decrease

# Excursions

Homebase Relative Occupancy

Radius of Turn

Diversity

Lingering Mean Speed

Segment Acceleration

# Progression Segments

Center Time

Segment Length

Segment Max Speed

Rate of Turn

Lingering Time

Distance Traveled

Proportion of Total Variance

Genotype

Individual

Laboratory

Interaction

Kafkafi&YB et al, PNAS ‘05

YB SAMSI ‘06

Our statistical diagnosis Our statistical diagnosis of the replicability problemof the replicability problem

Part I. Using the wrong “yardstick for variability”Part I. Using the wrong “yardstick for variability”

Fixed Model analysis Fixed Model analysis

treating labs’ effects as fixedtreating labs’ effects as fixed

Part II. Multiplicity problemsPart II. Multiplicity problems

many endpoints; repeated testing many endpoints; repeated testing (screening)(screening)

(Kafkafi &YB et al, PNAS ‘05)(Kafkafi &YB et al, PNAS ‘05)

YB SAMSI ‘06

3. Part 1: The mixed model 3. Part 1: The mixed model

• The existence of Lab*Strain interaction does not The existence of Lab*Strain interaction does not diminish the credibility of a behavioral endpoint - in diminish the credibility of a behavioral endpoint - in this sense this sense it is not a problemit is not a problem

• This interaction should be recognized as “ a fact of This interaction should be recognized as “ a fact of life”life”

• Interaction’s size is the right “yardstick” against Interaction’s size is the right “yardstick” against which genetic differences should be comparedwhich genetic differences should be compared

Statistically speaking:Statistically speaking:

Lab is a random factor, as is its interaction with strain.Lab is a random factor, as is its interaction with strain.

A A mixedmixed model should be used (rather than model should be used (rather than fixed)fixed)

YB SAMSI ‘06

The formal Mixed ModelThe formal Mixed Model

YLGI is the value of an endpoint for Laboratory L, Strain S,

index I represents the repetition within each group .

YLSI = S + aL +bL*S + LSI

G is the strain effect which is considered fixed,

aL ~ N(0, LAB) is the laboratory random effect,

bL*S ~ N(0, LAB*STRAIN ) is the interaction random effect,

LSI ~ N(0, ) is the individual variability

YB SAMSI ‘06

Implications of the Mixed ModelImplications of the Mixed Model

Source df MSE F p-value

Strain 7 102.5 44.8 0.00001

Lab 2 6.35 2.77 0.09

Lab*Strain 14 6.87 3.00 0.00028

Residuals 264 2.29

0.914.8 0.0028

Esimates of LAB and

LAB*STRAIN

•Technically

The threshold for significant strain differences

can be much higher

YB SAMSI ‘06

•Practically

1. For screening new mutants significance assessed

LabLab*Strain +

2. For screening new mutants vs locally

measured background significance assessed

Lab*Strain +

(1/n+1/m)

Unfortunately, even as sample sizes increase the interaction term does not disappear

YB SAMSI ‘06

3. For developing new endpoints

A single lab cannot suffice for the development

of a new endpoint -

no yardstick is available

Thus it’s the developer’s responsibility to offer estimates

of interaction variability and put it in a public dataset

(such as Jackson Laboratory)

YB SAMSI ‘06

Behavioral EndpointBehavioral Endpoint FixedFixed MixedMixed

Prop. Lingering TimeProp. Lingering Time 0.000010.00001 0.00290.0029

# Progression segments# Progression segments 0.000010.00001 0.00680.0068

Median Turn Radius (scaled)Median Turn Radius (scaled) 0.000010.00001 0.00920.0092

Time away from wallTime away from wall 0.000010.00001 0.01080.0108

Distance traveledDistance traveled 0.000010.00001 0.01440.0144

AccelerationAcceleration 0.000010.00001 0.01460.0146

# Excursions# Excursions 0.000010.00001 0.01780.0178

Time to half max speedTime to half max speed 0.000010.00001 0.02040.0204

Max speed wall segmentsMax speed wall segments 0.000010.00001 0.02570.0257

Median Turn rateMedian Turn rate 0.000010.00001 0.03200.0320

Spatial spreadSpatial spread 0.000010.00001 0.03880.0388

Lingering mean speedLingering mean speed 0.000010.00001 0.05880.0588

Homebase occupancyHomebase occupancy 0.0010.001 0.07120.0712

# stops per excursion# stops per excursion 0.00280.0028 0.12020.1202

Stop diversityStop diversity 0.0270.027 0.14890.1489

Length of progression segmentsLength of progression segments 0.440.44 0.51500.5150

Activity decreaseActivity decrease 0.670.67 0.88750.8875

Significance of 8 Strain differences

YB SAMSI ‘06

In summary of part IIn summary of part I

Practically, the threshold for making discoveries, in all aspects, is set at a higher level

Is this a drawback?

It is a way to weed out non-replicable differences

YB SAMSI ‘06

What about the warning:What about the warning:

““Never use a random factor unless your levels are a Never use a random factor unless your levels are a true random sample”true random sample”

• We do not agreeWe do not agree: Replicability in a new lab, is at : Replicability in a new lab, is at least partially captured by a random effect model.least partially captured by a random effect model.

• Revisit Jones Lewis and Tukey (2002)Revisit Jones Lewis and Tukey (2002)

July 2006 SAMSI

"Two alternatives, 'fixed' and 'variable', are not enough.A good way to provide a reasonable amount of realism is to define 'c' by appropriate error term = f-error term +c [ r-error term - f-error term ]

so that 'everything fixed' corresponds to c=0 and'random' is a particular case of c=1.It pays then to learn as much as possible about values of c in the real world”

c=0.5 fixed columns and illustrative weightsc=1.6 illustrative columns

A challenge: can we estimate c?

YB SAMSI ‘06

Behavioral EndpointBehavioral Endpoint MixedMixed

Prop. Lingering TimeProp. Lingering Time 0.00290.0029

# Progression segments# Progression segments 0.00680.0068

Median Turn Radius (scaled)Median Turn Radius (scaled) 0.00920.0092

Time away from wallTime away from wall 0.01080.0108

Distance traveledDistance traveled 0.01440.0144

AccelerationAcceleration 0.01460.0146

# Excursions# Excursions 0.01780.0178

Time to half max speedTime to half max speed 0.02040.0204

Max speed wall segmentsMax speed wall segments 0.02570.0257

Median Turn rateMedian Turn rate 0.03200.0320

Spatial spreadSpatial spread 0.03880.0388

Lingering mean speedLingering mean speed 0.05880.0588

Homebase occupancyHomebase occupancy 0.07120.0712

# stops per excursion# stops per excursion 0.12020.1202

Stop diversityStop diversity 0.14890.1489

Length of progression segmentsLength of progression segments 0.51500.5150

Activity decreaseActivity decrease 0.88750.8875

Should we believe

all p-value ≤ 0.05?

Not necessarily -

Beware of Multiplicity!

YB SAMSI ‘06

4. Part II: The Multiplicity Problem4. Part II: The Multiplicity Problem• The more statistical tests in a study - the larger the The more statistical tests in a study - the larger the

probability of making a type I errorprobability of making a type I error

• Stricter control - less power to discover a real effectStricter control - less power to discover a real effect

Traditional approaches Traditional approaches

• ““Don’t Don’t worryworry, , bebe happyhappy””

Conduct each test at the usual .05 level (eg Conduct each test at the usual .05 level (eg Ionadis’ PLOS paper )Ionadis’ PLOS paper )

• Panic!Panic!

Control the prob. of Control the prob. of making even a single type I making even a single type I error in the entireerror in the entire study at the usual level study at the usual level (e.g. (e.g. Bonferroni)Bonferroni)

Panic causes severe loss of power to discover in Panic causes severe loss of power to discover in large problemslarge problems

YB SAMSI ‘06

Prop. Lingering TimeProp. Lingering Time 0.00290.0029

Max speed wall segmentsMax speed wall segments 0.02570.0257

Median Turn rateMedian Turn rate 0.03200.0320

Spatial spreadSpatial spread 0.03880.0388

Stop diversityStop diversity 0.14890.1489

Length of progression segmentsLength of progression segments 0.51500.5150

Activity decreaseActivity decrease 0.88750.8875

Should we believe

all p-value ≤ 0.05?

Not necessarily -

Beware of Multiplicity!

Should we use

Bonferroni?

0.05*1/17=.0029

Panic causes severe Panic causes severe

loss of power loss of power

to discover to discover

in large problemsin large problems

YB SAMSI ‘06

• ““Genetic dissection of complex traits: guidelines for interpreting…”Genetic dissection of complex traits: guidelines for interpreting…” Lander and Kruglyak Lander and Kruglyak

• ““Adopting too lax a standard guarantees a Adopting too lax a standard guarantees a burgeoning literature of false positive linkage burgeoning literature of false positive linkage claims, each with its own symbol… Scientific claims, each with its own symbol… Scientific disciplines erode their credibility disciplines erode their credibility when substantial when substantial proportion of claims cannot be replicatedproportion of claims cannot be replicated…” …”

• ““On the other hand, adopting too high a hurdle for On the other hand, adopting too high a hurdle for reporting results runs the risk that nascent field will reporting results runs the risk that nascent field will be stillborn.” be stillborn.”

Is there an in-between approach?Is there an in-between approach?

YB SAMSI ‘06

The False Discovery Rate (FDR) criterionThe False Discovery Rate (FDR) criterion

The FDR approach takes seriously the concern of Lander & Kruglyak:The FDR approach takes seriously the concern of Lander & Kruglyak:

The error in the entire study is measured byThe error in the entire study is measured by

QQ = = the proportion of false discoveries among the discoveriesthe proportion of false discoveries among the discoveries

= 0 if none found, and= 0 if none found, and

FDR = E(Q)FDR = E(Q)

• If nothing is “real”, If nothing is “real”,

controlling the FDR at level controlling the FDR at level qq guarantees that the probability of guarantees that the probability of making even one false discovery is less than making even one false discovery is less than qq

This is why we choose usual levels of This is why we choose usual levels of qq, say 0.05, say 0.05

• But otherwise But otherwise there is room for improving detection power.there is room for improving detection power.

• This error rate is scalable; adaptive; economically interpretable.This error rate is scalable; adaptive; economically interpretable.

YB SAMSI ‘06

Our motivating work was Soriç (JASA 1989): Our motivating work was Soriç (JASA 1989):

If we use size 0.05 tests to decide upon statistical discoveries thenIf we use size 0.05 tests to decide upon statistical discoveries then

““there is danger that a large part of science is not true”there is danger that a large part of science is not true”

We define We define Q Q = V/R= V/R if R > 0if R > 0

= 0 = 0 if R = 0if R = 0

Soriç used E(V)/R for his demonstrations.Soriç used E(V)/R for his demonstrations.

More recently Ioannidis (PLoS Medicine ‘05) just repeated the More recently Ioannidis (PLoS Medicine ‘05) just repeated the argument using The Positive Predictive Value (PPV)argument using The Positive Predictive Value (PPV)

PPV = 1-QPPV = 1-Q

stating stating “most published research findings are false”“most published research findings are false”

For demonstration in his model he used For demonstration in his model he used

PPVPPV = 1-E(V)/E(R) = 1-FDR = 1-E(V)/E(R) = 1-FDR

Control of FDR assures large PPV under most of Ioannidis’ scenariosControl of FDR assures large PPV under most of Ioannidis’ scenarios

(except biases such as omission, publication, and interest)(except biases such as omission, publication, and interest)

YB SAMSI ‘06

Prop. Lingering TimeProp. Lingering Time 0.00290.0029 0.05*0.05*1/171/17=.0029=.0029

Max speed wall segmentsMax speed wall segments 0.02570.0257 0.05* 0.05* 99/17=/17=0.02670.0267

Median Turn rateMedian Turn rate 0.03200.0320 0.05*0.05*1010/17=0.0294/17=0.0294

Spatial spreadSpatial spread 0.03880.0388 0.05*0.05*1111/17=0.0323/17=0.0323

Stop diversityStop diversity 0.14890.1489 0.05*0.05*1515/17/17

Length of progression segmentsLength of progression segments 0.51500.5150 0.05*0.05*1616/17/17

Activity decreaseActivity decrease 0.88750.8875 0.05*0.05*17/1717/17

July 2006 SAMSI

Addressing multiplicity by controlling the FDR Addressing multiplicity by controlling the FDR

FDR control is a very active area of current researchmainly because of its scalability (even into millions…)

types of dependency resampling procedures FDR adjusted p-valuesadaptive proceduresBayesian interpretationsRelated error ratesModel selection

July 2006 SAMSI

Is there a Is there a mixed-multiplicitymixed-multiplicity connection? connection?

Recall in the puzzling entry to the Encyclopedia only two issues were addressed in detail •FDR approach in pairwise comparisons

•The random effects vs fixed effects problem

YB SAMSI ‘06

The mixed-multiplicity connectionThe mixed-multiplicity connection

• In the fixed framework we selected three labs and In the fixed framework we selected three labs and made our analysis as if this is our entire world of made our analysis as if this is our entire world of referencereference

• When this is not the case - as when the experiment When this is not the case - as when the experiment is repeated in a different lab - the fixed point of view is repeated in a different lab - the fixed point of view is overly optimisticis overly optimistic

• This is also an essence of the multiplicity problem - This is also an essence of the multiplicity problem - say selecting the maximal difference (with the say selecting the maximal difference (with the smallest p-value) and treating it as if it is our only smallest p-value) and treating it as if it is our only comparisoncomparison

In both cases, conclusions from a naïve point of view In both cases, conclusions from a naïve point of view have too great a chance to be non-replicablehave too great a chance to be non-replicable

YB SAMSI ‘06

5. Replicability in Medical Research5. Replicability in Medical ResearchHormone therapy in postmenopausal womenHormone therapy in postmenopausal women

• A very large and long randomized controlled study A very large and long randomized controlled study (Women’s Health Initiative, Rossouw, Anderson, (Women’s Health Initiative, Rossouw, Anderson, Prentice, LaCroix, JAMA,2002)Prentice, LaCroix, JAMA,2002)

Study wasStudy was not not performed for performed for drug approvaldrug approval

• Was stopped before completion because expected Was stopped before completion because expected effects were reversedeffects were reversed

• Bonferroni-adjusted and marginal (nominal) CIs Bonferroni-adjusted and marginal (nominal) CIs reportedreported

• The conclusions contradictory: The decision to stop The conclusions contradictory: The decision to stop the trial was based on the marginal CIsthe trial was based on the marginal CIs

YB SAMSI ‘06

The editorialThe editorial

““The authors present both nominal and The authors present both nominal and rarely usedrarely used adjusted CIsadjusted CIs to to take into account multiple testing, to to take into account multiple testing, thus widening the CIs. Whether such adjustment thus widening the CIs. Whether such adjustment should be used has been questioned, ... ". should be used has been questioned, ... ".

(Fletcher and Colditz, 2002)(Fletcher and Colditz, 2002)

Our Puzzle: US and European Regulatory Bodies Our Puzzle: US and European Regulatory Bodies require adjusting results in clinical trial to require adjusting results in clinical trial to multiplicity.multiplicity.

So,So, is the statement true?is the statement true?

Small Meta-analysis of Methods Small Meta-analysis of Methods (with Rami Cohen) (with Rami Cohen)

Check with the flagship of medical research:Check with the flagship of medical research:

YB SAMSI ‘06

Sampling the New England J of MedicineSampling the New England J of Medicine

• Period: 3 half-years, 2000,2002,2004Period: 3 half-years, 2000,2002,2004

• All articles of length > 6 pages; containing at least All articles of length > 6 pages; containing at least once “p=“once “p=“

• Sample of 20 from each period: 60 articlesSample of 20 from each period: 60 articles

No differences between periods - results reported No differences between periods - results reported pooled over periodspooled over periods

• 44/60 reported clinical trials’ results44/60 reported clinical trials’ results

YB SAMSI ‘06

How was multiplicity addressed?How was multiplicity addressed?

Type of CorrectionType of Correction # of Articles# of Articles

BonferroniBonferroni 6 (2)6 (2)

Obrian-FlemingObrian-Fleming 11

HochbergHochberg 11

HolmHolm 11

Lan-DeMetsLan-DeMets 11

More than 3 SDMore than 3 SD 11

Primary at .04;Primary at .04;

Two Secondary at .0175Two Secondary at .0175

NoneNone 4747

(out of 60 articles)

YB SAMSI ‘06

Success: All studies define primary endpointsSuccess: All studies define primary endpoints

Frequencies of number of endpoints

Nuber in Paper

Primary EndpointsSecondary Endpoints

YB SAMSI ‘06

Multiple endpointsMultiple endpoints

• No article had a single endpointNo article had a single endpoint

• 2 articles only corrected for multiple endpoints2 articles only corrected for multiple endpoints

• 80% define a single primary endpoint80% define a single primary endpoint

• In many cases there is no clear distinction between In many cases there is no clear distinction between primary and secondary endpointsprimary and secondary endpoints

• Even when a correction was made it was adjusted for a partial list.

(Note: Rami vs Yoav)

YB SAMSI ‘06

Multiple Conf. Intervals: two different concernsMultiple Conf. Intervals: two different concerns

• The effect of SimultaneityThe effect of Simultaneity

Pr(Pr(allall intervals cover their parameters) < 0.95 intervals cover their parameters) < 0.95

The goal of The goal of SimultaneousSimultaneous CIs, such as Bonferroni- CIs, such as Bonferroni-adjusted CIs, is to assure that Pr( adjusted CIs, is to assure that Pr( allall cover) ≥ 0.95 cover) ≥ 0.95

• The effect of Selection The effect of Selection

When only a subset of the parameters is selected for When only a subset of the parameters is selected for highlighting, for example the significant findings, highlighting, for example the significant findings, even the average coverage < 0.95even the average coverage < 0.95

YB SAMSI ‘06

Implications of selection on average coverage Implications of selection on average coverage

2/11 do not cover with no selection

2/3 do not cover when selecting significant coefficients

(BY & Yekutieli ‘05: FDR ideas for confidence intervals)(BY & Yekutieli ‘05: FDR ideas for confidence intervals)

YB SAMSI ‘06

So what?So what?

In MCP2005 conference in ShankhaiIn MCP2005 conference in Shankhai

Head of statistical unit in American FDA brought Head of statistical unit in American FDA brought amazing numbersamazing numbers

More than half of the Phase III studies fail to show the More than half of the Phase III studies fail to show the effect they were designed to show.effect they were designed to show.

Is it at least partly because clinical trials are analyzed Is it at least partly because clinical trials are analyzed “loosely” in terms of multiplicity before standing up “loosely” in terms of multiplicity before standing up to the regulatory agencies, and thus their results to the regulatory agencies, and thus their results are not replicable?are not replicable?

More comments in view of Ionadis’ paper at a later More comments in view of Ionadis’ paper at a later timetime

YB SAMSI ‘06

6. Functional Magnetic Resonance Imaging (fMRI)6. Functional Magnetic Resonance Imaging (fMRI)

• Study of the functioning brain: Where is the brain Study of the functioning brain: Where is the brain active when we perform a mental task?active when we perform a mental task?

• ExampleExample

YB SAMSI ‘06

Functional Magnetic Resonance Imaging (fMRI)Functional Magnetic Resonance Imaging (fMRI)

• Study of the functioning brain: Where is the brain Study of the functioning brain: Where is the brain active when we perform a mental task?active when we perform a mental task?

• ExampleExample

YB SAMSI ‘06

Unit of data is Unit of data is vovolume pilume pixelxel - - Voxel Voxel 64 x 64 per slice x 16 64 x 64 per slice x 16

A comparison of experimental factor per voxel: inference onA comparison of experimental factor per voxel: inference on~ 64K voxels~ 64K voxels

YB SAMSI ‘06

Assuring replicability in fMRI Analysis: Part IIAssuring replicability in fMRI Analysis: Part II

MultiplicityMultiplicity was addressed early on was addressed early on

1.1. Controlling the prob. of making a false discovery even in a single Controlling the prob. of making a false discovery even in a single voxel: voxel:

1.1. FWE control using random field theory of level sets FWE control using random field theory of level sets (Worsley & Friston ‘95, Adler’s (Worsley & Friston ‘95, Adler’s

theory)theory)

2.2. FWE control with resampling FWE control with resampling (Nichols & Holmes ‘96)(Nichols & Holmes ‘96)

3.3. Extra power by limiting # voxels tested using Extra power by limiting # voxels tested using RRegions egions OOf f IInterest (nterest (ROIROI)) from an independent studyfrom an independent study

2.2. Genovese, Lazar & Nichols (‘02) introduced FDR into fMRI Genovese, Lazar & Nichols (‘02) introduced FDR into fMRI analysis. analysis.

FDR voxel-based analysis available in most software packages FDR voxel-based analysis available in most software packages e.g. Brain Voyager, SPM, fMRIstate.g. Brain Voyager, SPM, fMRIstat

YB SAMSI ‘06

fMRI: More to do on the multiplicity frontfMRI: More to do on the multiplicity front

• Working with regions rather than voxels:Working with regions rather than voxels:

– Utilizing activity is in regionsUtilizing activity is in regions

– Defining appropriate FDR on regions Defining appropriate FDR on regions

– Trimming non-active voxels from active regionsTrimming non-active voxels from active regions

• Using adaptive FDR proceduresUsing adaptive FDR procedures

etc.etc.

Pacifico et al (‘04)Pacifico et al (‘04) Heller et al (‘05) Heller & YB (‘06)Heller et al (‘05) Heller & YB (‘06)

YB SAMSI ‘06

““The Good the Bad and the Ugly”The Good the Bad and the Ugly”

YB SAMSI ‘06

““The Good the Bad and the Ugly”The Good the Bad and the Ugly”

YB SAMSI ‘06

Assuring replicability in fMRI Analysis: Part IAssuring replicability in fMRI Analysis: Part I

Initially results were reported for each subject separately.Initially results were reported for each subject separately.

Then fixed model ANOVA was used to analyze the Then fixed model ANOVA was used to analyze the multiple subjectsmultiple subjects - -

within subject “yardstick for variability” only.within subject “yardstick for variability” only.

Concern aboutConcern about between subject variability between subject variability was raised was raised more recently.more recently.

Mixed models analysis, with random effects for subjects, Mixed models analysis, with random effects for subjects, is now available in the main software tools. is now available in the main software tools. This is called This is called multi-subjectmulti-subject analysis analysis

Obviously, the number of degrees of freedom is much Obviously, the number of degrees of freedom is much smaller and the variability is largersmaller and the variability is larger

YB SAMSI ‘06

Multi-Subjectusing random effects

Single subject

YB SAMSI ‘06

Tricks of the trade:

Using correlation at first session as pilot q1=2/3

at second session at q2=.075

YB SAMSI ‘06

Why is Random Effects (mixed model) Why is Random Effects (mixed model) analysis insensitiveanalysis insensitive

1.1. Variability between subjects about location of Variability between subjects about location of activity and activity and

2.2. Task specific variability of location between Task specific variability of location between subjectssubjects

3.3. Variability about the size of the activityVariability about the size of the activity

4.4. Problems in mapping different subjects to a single Problems in mapping different subjects to a single map of the brainmap of the brain

5.5. Use of uniform smoothing across the brain to solve Use of uniform smoothing across the brain to solve problem (3) reduces the signal per voxel problem (3) reduces the signal per voxel

6.6. Pattern of change in signal along time differs Pattern of change in signal along time differs between individualsbetween individuals

YB SAMSI ‘06

EpilogEpilog

• The debate The debate Fixed-vs-RandomFixed-vs-Random is fierce in the is fierce in the community of neuroimagers. Acceptance to the community of neuroimagers. Acceptance to the best journals seems to depend on chance: will the best journals seems to depend on chance: will the article meet a “Random Effects Referee”?article meet a “Random Effects Referee”?

• One can read that “the multi subject analysis is less One can read that “the multi subject analysis is less sensitive, so results were not corrected for sensitive, so results were not corrected for multiplicity”multiplicity”

• Researchers sometimes resort to more Researchers sometimes resort to more questionable ways to control for multiplicity and questionable ways to control for multiplicity and across subject variabilityacross subject variability

• e.g. Conjunction Analysis e.g. Conjunction Analysis

YB SAMSI ‘06

EpilogEpilogUsing the statistic TUsing the statistic Tvivi to test for each subject i: to test for each subject i:

HH0vi0vi: there is no effect at voxel v for subject i: there is no effect at voxel v for subject i

Conjunction analysisConjunction analysis: intersect the individual subjects’ maps : intersect the individual subjects’ maps

Friston, Worseley and others compare TFriston, Worseley and others compare Tvv=min =min 1≤i≤n1≤i≤n T Tvi vi to a to a

(lower(lower) random field theory based threshold. ) random field theory based threshold.

Nichols et al (‘05): the “complete null” is tested at each voxel, so a Nichols et al (‘05): the “complete null” is tested at each voxel, so a rejection merely indicates that at least for one subject there is rejection merely indicates that at least for one subject there is an effect at the voxel. an effect at the voxel.

(IS IT ENOUGH TO ASSURE REPLICABILITY?) (IS IT ENOUGH TO ASSURE REPLICABILITY?)

Instead TInstead Tv v should be compared to the regular threshold to test the should be compared to the regular threshold to test the

hypothesis that all have effect, and then multiplicity strictly hypothesis that all have effect, and then multiplicity strictly controlled.controlled.

YB SAMSI ‘06

EpilogEpilog

Friston et al (‘05) object to this proposal, because of Friston et al (‘05) object to this proposal, because of loss of power; they suggest testing: “there is effect loss of power; they suggest testing: “there is effect in at least u out of n subjects” as the alternative, in at least u out of n subjects” as the alternative, and then strictly control multiplicity*and then strictly control multiplicity*

It is clear that a compromise that addresses It is clear that a compromise that addresses replicability - both multiplicity and between subject replicability - both multiplicity and between subject variability - and sensitivity is needed. variability - and sensitivity is needed.

Is this the case where Tukey’s ideas about 0<c<1 may Is this the case where Tukey’s ideas about 0<c<1 may become essential? become essential?

It may very well be!It may very well be!

YB SAMSI ‘06

In summaryIn summary

Assuring the replicability of results of an experiment is Assuring the replicability of results of an experiment is at the heart of the scientific dogmaat the heart of the scientific dogma

Watch out for two statistical dangers to replicabilityWatch out for two statistical dangers to replicability

– Ignoring the variability in those selected to be Ignoring the variability in those selected to be studied - thus using the wrong “yardstick for studied - thus using the wrong “yardstick for variability”variability”

– Selecting to emphasize your best results Selecting to emphasize your best results

The second problem emerges naturally when The second problem emerges naturally when multiple inferences are made and multiplicity is multiple inferences are made and multiplicity is ignored. ignored.

YB SAMSI ‘06

The FDR website The FDR website www.math.tau.ac.il/~ybenjawww.math.tau.ac.il/~ybenja

YB SAMSI ‘06

MCP type Clinical trial

Number of families

Number of correction

Percent of correction

Yes 27 3 11% Multiple subsets No 10 0 0% Yes 19 4 21% Multiple peeks No 6 1 16% Yes 7 2 28% Many-to-one No 4 1 25% Yes 5 0 0% Pairwise comparison No 0 0 0% Yes 58 9 16% All No 20 2 10%

All Articles All 78 11 14%

Further details about the failures to adjust

out of 60 articles

YB SAMSI ‘06

The False Discovery Rate (FDR) criterionThe False Discovery Rate (FDR) criterion

Benjamini and Hochberg (95)Benjamini and Hochberg (95)

RR = # rejected hypotheses = # discoveries = # rejected hypotheses = # discoveries

VV of these may be in error = # false discoveries of these may be in error = # false discoveries

The error (type I) in the entire study is measured byThe error (type I) in the entire study is measured by

i.e. Q is i.e. Q is the proportion of false discoveries among the the proportion of false discoveries among the discoveriesdiscoveries (0 if none found) (0 if none found)

FDR = E(Q)FDR = E(Q)

Does it make sense?Does it make sense?

YB SAMSI ‘06

Does it make sense?Does it make sense?• Inspecting 20 features:Inspecting 20 features:

11 false among 20 discovered - false among 20 discovered - bearablebearable

11 false among 2 discovered - false among 2 discovered - unbearableunbearable

This error rate is adaptive and has also economic interpretationThis error rate is adaptive and has also economic interpretation

• Inspecting 100 features the above remains the sameInspecting 100 features the above remains the same

So this error rate is also scalableSo this error rate is also scalable

• If nothing is “real”, If nothing is “real”,

controlling the FDR at level controlling the FDR at level qq guarantees that the probability of guarantees that the probability of making even one false discovery is less than making even one false discovery is less than qq

• This is why we choose usual levels of This is why we choose usual levels of qq, say 0.05, say 0.05

• But otherwise But otherwise there is room for improving detection power:there is room for improving detection power:

YB SAMSI ‘06

FDR controlling proceures.FDR controlling proceures.

Linear step up procedureLinear step up procedure (BH, FDR) (BH, FDR)

PPii be the observed p-value of a test for be the observed p-value of a test for HHii i=1,2,…m i=1,2,…m

• Order the p-values Order the p-values PP(1)(1) ≤ P ≤ P(2)(2) ≤…≤ P ≤…≤ P(m)(m)

• Let Let

• Reject Reject

kmax{i:p(i)(i / m)q}

H(1),H(2),...,H(k )

YB SAMSI ‘06

FDR control of Linear StepUp Procedure (BH)FDR control of Linear StepUp Procedure (BH)

Suppose mSuppose m00 ≤ m of the hypotheses are true ≤ m of the hypotheses are true

If the test statistics are If the test statistics are

independent, or positive dependent :independent, or positive dependent :

in generalin general ::

normally distributednormally distributed

FDR m0

mq (1 1/ 2 1/ 3 1/ m)

mq log(m)

july 2006samsi copyright, 1996 © dale carnegie & associates, inc. some thoughts on...

yb samsi

replicability slide

samsi copyright

epilog slide

tel aviv university

nida behaviortracking

single best slide

replicability problems

Documents

yoav benjamini department of statistics and operations

24 benjamini mage_21235

false discovery rates for spatial...

second edition -...

methodology of the economic replicability test ... - aek.mk

23 benjamini mage_21235

15 benjamini mage_21235

yoav liberman's portable bench

dr. yoav mazeh

yoav lerman's dissertation

benjamini okumak

discovering the false discovery rate - cmu...

economic replicability testing for nga services

replicability, scalability and exploitation · 2020. 9....

replicability of neural computing experiments

2 benjamini mage_21235

yoav benjamini, "in the world beyond p

reproducibility, replicability, and generalization in...

credibility, replicability, and reproducibility in

4 benjamini mage_21235