controversy over the significance test controversy

Download Controversy Over the Significance Test Controversy

Post on 16-Apr-2017




6 download

Embed Size (px)


PowerPoint Presentation

Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises Controversy Over the Significance Test Controversy


Philosophy of Science Association Biennial MeetingNovember 4, 2016Deborah G Mayo (Virginia Tech)

Science is in Crisis! Once high profile failures of replication went beyond the social sciences to genomics, bioinformatics, people started to worry about scientific credibility Replication research, methodological activism, fraudbusting, statistical forensics


Methodological Reforms without philosophy of statistics are blind

Proposed methodological reforms are being adoptedmany welcome (preregistration)some quite radical Without better understanding of the philosophical, statistical, historical issues many are likely to fail


American Statistical Association (ASA):Statement on P-values

The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. . much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices such asto ban P-values (ASA 2016)


2015: ASA brought membersfromdiffering tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values 5

I was a philosophical observer at the ASA P-value pow wow


Dont throw out the error control baby with the bad statistics bathwater The American Statistician


Error StatisticsStatistics: Collection, modeling, drawing inferences from data to claims about aspects of processesThe inference may be in errorIts qualified by a claim about the methods capabilities to control and alert us to erroneous interpretations (error probabilities)Significance tests (R.A. Fisher) are a small part of an error statistical methodology


p-value. to test the conformity of the particular data under analysis withH0in some respect:we find a functionT=t(y) of the data, to be called the test statistic, such thatthe larger the value ofTthe more inconsistent are the data withH0;The random variableT = t(Y)has a (numerically) known probability distribution whenH0is true.the p-value corresponding to any t0bsasp = Pr(t) = Pr(T t0bs; H0) (Mayo and Cox 2006, p. 81)


Testing Reasoning If even larger differences thant0bsoccur fairly frequently underH0(P-value is not small), theres scarcely evidence of incompatibilitywithH0 Small P-value indicates some underlying discrepancy fromH0because very probably you would have seen a less impressive difference than t0bswere H0true. This indication isnt evidence of a genuine statistical effect H, let alone a scientific conclusion H*Stat-Sub fallacy H => H*


Neyman-Pearson (N-P) tests: A null and alternative hypotheses H0, H1 that are exhaustiveH0: 12 vs H0: > 12

So this fallacy of rejection HH* is impossibleRejecting the null only indicates statistical alternatives (how discrepant from null)


Im not keen to defend many uses of significance tests long lampoonedI introduce a reformulation of tests in terms of discrepancies (effect sizes) that are and are not severely-testedThe criticisms are often based on misunderstandings; consequently so are many reforms


A paradox for significance test critics Critic: Its much too easy to get small P-values.

You: Why do they find it so difficult to replicate the small P-valuesin published reports?

Is it easy or is it hard?


Only 36 of 100 psychology experiments yielded small P-values in Open Science Collaboration on replication in psychology

OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015): Crowd-sourced effort to replicate 100 articles (Led by Brian Nozek, U. VA)


R.A. Fisher: its easy to lie with statistics by selective reporting, political principleSufficient finaglingcherry-picking, P-hacking, significance seeking, multiple testing, look elsewheremay practically guarantee a preferred claim H gets support, even if its unwarranted by evidence (verification fallacy)(biasing selection effects, need to adjust P-values) Note: Rejecting a null taken as support for some non-null claim H15

You report: Such results would be difficult to achieve under the assumption of H0 When in fact such results are common under the assumption of H0

The ASA (p. 131) correctly warns that [c]onducting multiple analyses of the data and reporting only those with certain p-values leads to spurious p-values (Principle 4)

You say Pr(P-value Pobs; H0) = Pobs (small)But in fact Pr(P-value Pobs; H0) = high*

*Note P-values measure distance from H0 in reverse


Minimal (Severity) Requirement for evidence

If the test procedure had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H (too cheap to be worth having Popper)Such a test fails a minimal requirement for a stringent or severe test My account: severe testing based on error statistics (requires reinterpreting tests)


Alters role of probability: typically just 2Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0. (e.g., Bayesian, likelihoodist)with regard for inner coherencyPerformance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson)


What happened to using probability to assess error probing capacity and severity?

Neither probabilism nor performance directly captures it Good long-run performance is a necessary, not a sufficient, condition for severity


A claim H is not warranted _______

Probabilism: unless H is true or probable (or gets a probability boost, is made comparatively firmer)

Performance: unless it stems from a method with low long-run error

Probativism (severe testing) unless something (a fair amount) has been done to probe ways we can be wrong about H



Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs Its that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpreting data


If you assume probabilism, error probabilities are relevant for inference only by misinterpretation False! They play a key role in appraising well-testedness Its crucial to be able to say, H is believable or plausible but this is a poor test of it With this in mind consider a continuation of the paradox of replication


Critic: Its too easy to satisfy standard significance thresholds You: Why do replicationists find it so hard to achieve significance thresholds (with preregistration)? Critic: Obviously the initial studies were guilty of P-hacking, cherry-picking, data-dredging (QRPs)You: So, the replication researchers want methods that pick up on, adjust, and block these biasing selection effects.Critic: Actually reforms recommend methods where the need to alter P-values due to data dredging vanishes23

Likelihood Principle (LP)The vanishing act links to a pivotal disagreement in the philosophy of statistics battles In probabilisms (Bayes factors, posteriors), the import of the data is via the ratios of likelihoods of hypotheses P(x0;H1)/P(x0;H0) for x0 fixedThey condition on the actual data,error probabilities take into account other outcomes that could have occurred but did not (sampling distribution)


All error probabilities violate the LP (even without selection effects):

Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]something that is irrelevant in Bayesian inferencenamely the sample space. (Lindley 1971, p. 436)The LP impliesthe irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects. (Rosenkrantz, 1977, p. 122)25

Todays Meta-research is not free of philosophy of statisticsTwo problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-valueBut adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of objectivity that is often made for the P-value (Goodman 1999, p. 1010) (To his credit, hes open about this; heads the Meta-Research Innovation Center at Stanford)


Sum-up so far: Main source of hand-wringing behind the statistical crisis in science stems from cherry-picking, hunting for significance, P-hacking Picked up by concern for performance or severity (but violated in abuses of tests) Reforms based on probabilisms enable rather than check unreliable results due to biasing selection effects Bayes factors can be used in the complete absence of a sampling plan (Bayarri, Benjamin, Berger, Sellke 2016) Probabilists may find other ways to block bad inferences: background beliefsfor the discussion


A few remarks on interconnected issues that cry out for philosophical insight28

1. Replication research Aims to use significance tests correctlyPreregistered, avoid P-hacking, designed to have high power Free of perverse incentives of usual research: guaranteed to be published


Repligate Replication research has pushback: some call it methodological terrorism (enforcing good science or bullying?) Im (largely) on pro-replication side, but they need to go further


View more >