d. mayo: replication research under an error statistical philosophy

64
SPP D. Mayo 1 Replication Research Under an Error Statistical Philosophy Deborah Mayo Around a year ago on my blog: “There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing” Philosopher’s talk: I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics

Upload: jemille6

Post on 30-Jul-2015

5.666 views

Category:

Education


1 download

TRANSCRIPT

Page 1: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   1  

Replication Research Under an Error Statistical Philosophy

Deborah Mayo

Around a year ago on my blog: “There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing”

Philosopher’s talk: I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics

Page 2: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   2  

Three main philosophical tasks: #1 Clarify concepts and presuppositions #2 Reveal inconsistencies, puzzles, tensions (“ironies”) #3 Solve problems, improve on methodology

• Philosophers usually stop with the first two, but I think

going on to solve problems is important. This presentation is ‘programmatic’- what might replication research under an error statistical philosophy be? My interest grew thanks to Caitlin Parker whose MA thesis was on the topic

Page 3: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   3  

Example of a conceptual clarification (#1)

Editors of a journal, Basic and Applied Social Psychology, announced they are banning statistical hypothesis testing because it is “invalid” It’s invalid because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H0) (2015 Trafimow and Marks) • Since the methodology of testing explicitly rejects the mode

of inference they don’t supply, it would be incorrect to claim the methods were invalid.

• Simple conceptual job that philosophers are good at

Page 4: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   4  

Example of revealing inconsistencies and tensions (#2) Critic: It’s too easy to satisfy standard significance thresholds You: Why do replicationists find it so hard to achieve significance thresholds? Critic: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPs You: So, the replication researchers want methods that pick up on and block these biasing selection effects. Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference

Page 5: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   5  

Whether this can be resolved or not is separate. • We are constantly hearing of how the “reward structure”

leads to taking advantage of researcher flexibility • As philosophers, we can at least show how to hold their

feet to the fire, and warn of the perils of accounts that bury the finagling

The philosopher is the curmudgeon (takes chutzpah!) I’ll give examples of #1 clarifying terms #2 inconsistencies #3 proposed solutions (though I won’t always number them) .

Page 6: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   6  

Demarcation: Bad Methodology/Bad Statistics

• A lot of the recent attention grew out of the case of Diederik

Stapel, the social psychologist who fabricated his data. • Kahneman  in  2012  “I  see  a  train-­‐wreck  looming,”  setting  up  a  “daisy  chain”  of  replication.  

• The Stapel investigators: 2012 Tilberg Report, “Flawed Science” do a good job of characterizing pseudoscience.

• Philosophers tend to have cold feet when it comes to saying anything general about science versus pseudoscience.

Page 7: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   7  

Items in their list of “dirty laundry” include: “An experiment fails to yield the expected statistically significant results. The experimenters try and try again until they find something (multiple testing, multiple modeling, post-data search of endpoint or subgroups), and the only experiment subsequently reported is the one that did yield the expected results.”

… continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts”. (Report, 48)

--they walked into a “culture of verification bias”  

Page 8: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   8  

Bad Statistics Severity Requirement: If data x0 agree with a hypothesis H, but the test procedure had little or no capability, i.e., little or no probability of finding flaws with H (even if H is incorrect), then x0 provide poor evidence for H.

Such a test we would say fails a minimal requirement for a stringent or severe test.

• This seems utterly uncontroversial.

Page 9: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   9  

• Methods that scrutinize a test’s capabilities, according to their severity, I call error statistical.

• Existing error probabilities (confidence levels, significance

levels) may but need not provide severity assessments.

• New name: frequentist, sampling theory, Fisherian, Neyman-Pearsonian—are too associated with hard line views and personality conflicts (“It’s the methods, stupid”)

(example of new solutions #3)

Page 10: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   10  

Are philosophies about science relevant? One of the final recommendations in the Report is this:

In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. (p. 57)  

Page 11: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   11  

A critic might protest: “There’s nothing philosophical about my criticism of significance tests: a small p-value is invariably, and erroneously, interpreted as giving a small probability to the null hypothesis that the observed difference is mere chance.” Really? P-values are not intended to be used this way; presupposing they should be stems from a conception of the role of probability in statistical inference—this conception is philosophical. (of course criticizing them because they might be misinterpreted is just silly)

Page 12: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   12  

Two  main  views  of  the  role  of  probability  in  inference  Probabilism.  To  provide  a  post-­‐data  assignment  of  degree  of  probability,  confirmation,  support  or  belief  in  a  hypothesis,  absolute  or  comparative,  given  data  x0.      Performance.  To  ensure  long-­‐run  reliability  of  methods,  coverage  probabilities,  control  the relative frequency of erroneous inferences in a long-run series of trials.    What happened to the goal of scrutinizing bad science by the severity criterion?

Page 13: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   13  

• Neither “probabilism” nor “performance” directly captures

it. • Good long-run performance is a necessary not a sufficient

condition for avoiding insevere tests.  

• The problems with selective reporting, multiple testing, stopping when the data look good are not problems about long-runs—

• It’s that we cannot say about the case at hand that it has

done a good job of avoiding the sources of misinterpretation.

 

Page 14: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   14  

• Probabilism  says  H  is  not  justified  unless  it’s  true  or  probable  (made  firmer).  

• Error  statistics  (probativism)  says  H  is  not  justified  unless  something  (a  good  job)  has  been  done  to  probe  ways  we  can  be  wrong  about  H.  

• If  it’s  assumed  probabilism  is  required  for  inference,  error  probabilities  could  be  relevant  only  by  misinterpretation.  False!  

• Error  probabilities  have  a  crucial  role  in  appraising  well-­‐testedness  (new  philosophy  for  probability  #3)    

• Both  H  and  not-­‐H  be  can  be  poorly  tested,  so  a  severe  testing  assessment  violates  probability  

Page 15: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   15  

Understanding  the  Replication  Crisis  Requires  Understanding  How  it  Intermingles  with  PhilStat  Controversies      

• It’s not that I’m keen to defend many common uses of significance tests

• It’s just that the criticisms (in psychology and elsewhere)

are based on serious misunderstandings of the nature and role of these methods; consequently so are many “reforms”

• How can you be clear the reforms are better if you might be

mistaken about existing methods?

Page 16: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   16  

Criticisms  concern  a  kind  of  Fisherian  Significance  Test (i) Sample  space:  Let  the  sample  be  X  =  (X1,  …,Xn),  be  n  iid  (independent  and  identically  distributed)  outcomes  from  a  Normal  distribution  with  standard  deviation    σ      (ii)  A  null  hypothesis  H0:  µ  =    0    (Δ: µΤ − µC = 0)

   (iii)  Test  statistic:  A  function  of  the  sample,  d(X)  reflecting  the  difference  between  the  data  x0  =  (x1,  …,xn),  and  H0:    

The  larger  d(x0)  the  further  the  outcome  from  what’s  expected  under  H0,  with  respect  to  the  particular  question.    

   (iv)  Sampling  distribution  of  test  statistic:  d(X)  

Page 17: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   17  

The  p-­‐value  is  the  probability  of  a  difference  larger  than  d(x0),  under  the  assumption  that  H0  is  true:  

p(x0)=Pr(d(X)  >  d(x0);  H0).    If p(x0)  is  sufficiently  small,  there’s  an  indication  of  discrepancy  from  the  null.    (Even  Fisher  had  implicit  alternatives,  by  the  way)

Page 18: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   18  

P-­‐value  reasoning:  from  high  capacity  to  curb  enthusiasm  

 

If  the  hypothesis  H0  is  correct  then,  with  high  probability,  1-­‐p,  the  data  would  not  be  statistically  significant  at  level  p.  

x0  is  statistically  significant  at  level  p.  

____________________________  

Thus,  x0  indicates  a  discrepancy  from  H0.  

 That  merely  indicates  some  discrepancy!  

Page 19: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   19  

A genuine experimental effect is needed “[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1935, 14)

(low P-value ≠> H: statistical effect) “[A]ccording  to  Fisher,  rejecting  the  null  hypothesis  is  not  equivalent  to  accepting  the  efficacy  of  the  cause  in  question.  The  latter...requires  obtaining  more  significant  results  when  the  experiment,  or  an  improvement  of  it,  is  repeated  at  other  laboratories  or  under  other  conditions.”  (Gigerentzer  1989,  95-­‐6)  (H ≠> H*)

Page 20: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   20  

 Still,  simple  Fisherian  Tests  have  Important  Uses      

• Testing  assumptions  • Fraudbusting  and  forensics:  Finding  Data  too  good  to  be  true  (Simonsohn)  

• Finding  if  data  are  consistent  with  a  model   Gelman and Shalizi (meeting of minds between a Bayesian and an error statistician) “What we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data, rather than entering into a contest with some alternative model.” (p.20)  

Page 21: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   21  

Fallacy  of  Rejection:  H  –  >  H*  :  Erroneously  take  statistical  significance  as  evidence  of  research  hypothesis  H*      The  fallacy  is  explicated  by  severity:  flaws  in  alternative  H*  have  not  been  probed  by  the  test,  the  inference  from  a  statistically  significant  result  to  H*  fails  to  pass  with  severity    Merely refuting the null hypothesis is too weak to corroborate substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called ‘a highly improbable coincidence.’” (Meehl and Waller 2002, 184)  

(Meehl  was  wrong  to  blame  Fisher)  

Page 22: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   22  

NHST  are  pseudostatistical:    Why  do  psychologists  speak  of  NHSTs  –tests  that  supposedly  allow  moving  from  statistical  to  substantive?    So  defined,  they  exist  only  as  abuses  of  tests:  they  exist as something you’re never supposed to do      Psychologists  tend  to  ignore  Neyman-­‐Pearson  (N-­‐P)  tests:  N-­‐P  supplemented  Fisher’s  tests  with  explicit  alternatives        

Page 23: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   23  

Neyman-­‐Pearson  (N-­‐P)  Tests:  A  null  and  alternative  hypotheses  H0,  H1  that  exhaust  the  parameter  space    So  the  fallacy  of  rejection  H  –  >  H*  is  impossible  (rejecting  the  null  only  indicates  statistical  alternatives)    Scotches  criticisms  that  P-­‐values  are  only  under  the  null    Example:  Test  T+:    sampling  distribution  of  d(x)  under  null  and  alternatives.  H0:  µ  ≤  µ0    vs.      H1:  µ  >  µ0        if  d(x0)  >    cα,  "reject"  H0,      if  d(x0)  <    cα,  "do  not  reject”  or  “accept"  H0,    e.g.  cα=1.96  for  α=.025  

Page 24: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   24  

   The  sampling  distribution  yields  Error  Probabilities    Probability  of  a  Type  I  error  =  P(d(X)  >    cα;  H0)  ≤    α.  Probability  of  a  Type  II  error:  =  P(d(X)  <  cα;  H0)  =  ß(µ1),  for  any  µ1  >  µ0.  

The  complement  of  the  Type  II  error  probability=  power  against  (µ1)  

POW(µ1)=  P(d(X)  >  cα;  µ1)  

Even  without  “best”  tests,  there  are  “good”  tests      

Page 25: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   25  

N-­‐P  test  in  terms  of  the  P-­‐value:  reject  H0  iff  P-­‐value  <  .025    

• Even  N-­‐P  report  the  attained  significance  level  or  P-­‐value  (Lehmann)  

 • “reject/do  not  reject”  uninterpreted  parts  of  the  mathematical  apparatus  

 Reject  could  be:  “Declare  statistically  significant  at  the  p-­‐level”    

• “The  tests…  must  be  used  with  discretion  and  understanding”  (N-­‐P,  1928,  p.  58)  

(“it’s  the  methods,  stupid”)        

Page 26: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   26  

Why  Inductive  behavior?  

N-­‐P  justify  tests  (and  confidence  intervals)  by  performance,  control  of  long-­‐run  error  coverage  probabilities    

They  called  this  inductive  behavior,  why?  

• They  were  reaching  conclusions  beyond  the  data  (inductive)  

• If  inductive  inference  is  probabilist,  then  they  needed  a  new  term.  

In  Popperian  spirit,  they  (mostly  Neyman)  called  it  inductive  behavior-­‐-­‐  adjust  how  we’d  act  rather  than  beliefs  

(I’m  not  knocking  performance,  but  error  probabilities  also  serve  for  particular  inferences—evidential)  

Page 27: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   27  

N-­‐P  tests  can  still  commit  a  type  of  fallacy  of  rejection:  Infer  a  discrepancy  beyond  what’s  warranted:    ––especially  with n sufficiently large:  large  n  problem.  

• Severity  tells  us:  an  α-­‐significant  difference  is  indicative  of  less  of  a  discrepancy  from  the  null  if  it  results  from  larger  (n1)  rather  than  a  smaller  (n2)  sample  size  (n1  >  n2  )

What’s  more  indicative  of  a  large  effect  (fire),  a  fire  alarm  that  goes  off  with  burnt  toast  or  one  so  insensitive  that  it  doesn’t  go  off  unless  the  house  is  fully  ablaze?  [The  larger  sample  size  is  like  the  one  that  goes  off  with  burnt  toast.)    

   

Page 28: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   28  

Fallacy  of  Non-­‐Significant  results:  Insensitive  tests    • Negative  results  may  not  warrant  0  discrepancy  from  the  null,  but  we  can  use  severity  to  rule  out  discrepancies  that,  with  high  probability,  would  have  resulted  in  a  larger  difference  than  observed    

Similar  to  Cohen’s  power  analysis  but  sensitive  to  the  outcome—P-­‐value  distribution  (#3)  

 • I  hear  some  replicationists  say  negative  results  are  uninformative:  not  so  (#2  ironies)  

No  point  in  running  replication  research  if  your  account  views  negative  results  as  uninformative  

Page 29: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   29  

Error  statistics  gives  evidential  interpretation  to  tests  (#3)    Use  results  to  infer  discrepancies  from  a  null  that  are  well  ruled-­‐out,  and  those  which  are  not      I’d  never  just  report  a  P-­‐value    Mayo  (1996);    Mayo  and  Cox  (2010):  Frequentist    Principle  of  Evidence:  FEV    

Mayo  and  Spanos  (2006):  SEV    

Page 30: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   30  

One-­‐sided  Test  T+:    H0:  µ  <  µ0    vs.      H1:  µ  >  µ0      d(x)  is  statistically  significant  (set  lower  bounds)    (i)  If  the  test  had  high  capacity  to  warn  us  (by  producing  a  less  significant  result)  if  µ  ≤  µ0  +  γ.  then  d(x)  is  a  good  indication  of  µ  >  µ0  +  γ.    (ii)  If  the  test  had  little  (or  even  moderate)  capacity  (e.g.  <  .5)  to  produce  a  less  significant  result  even  if  µ  ≤  µ0  +  γ,  then  d(x)  is  a  poor  indication  of  µ  >  µ0  +  γ    (If  an  even  more  impressive  result  is  probable,  due  to  guppies,  it’s  not  a  good  indication  of  a  great  whale)    

Page 31: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   31  

d(x)  is  not  statistically  significant  (set  upper  bounds)      (i)If  the  test  had  a  high  probability  of  producing  a  more  statistically  significant  difference  if  µ  >  µ0  +  γ,  then  d(x)  is  a  good  indication  that  µ  ≤  µ0  +  γ.  

 (ii)  If  the  test  had  a  low  probability  of  a  more  statistically  significant  difference  if  µ  >  µ0  +  γ,  then  d(x)  is  poor  indication  that  µ  ≤  µ0  +  γ.  (too  insensitive  to  rule  out  discrepancy  γ)    If  you  set  an  overly  stringent  significance  level  in  order  to  block  rejecting  a  null,  we  can  determine  the  discrepancies  you  can’t  detect  (e.g.,  risks  of  concern)  

Page 32: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   32  

Confidence  Intervals  also  require  supplementing      Duality  between  tests  and  intervals:  values  within  the  (1  -­‐  α)  CI  are  non-­‐rejectable  at  the  α  level    • Still  too  dichotomous:  in  /out,  plausible/not  plausible  (Permit  fallacies  of  rejection/non-­‐rejection).  

• Justified  in  terms  of  long-­‐run  coverage  (performance).  • All  members  of  the  CI  treated  on  par.  • Fixed  confidence  level  (SEV  needs  several  benchmarks).  • Estimation  is  important  but  we  need  tests  for  distinguishing  real  and  spurious  effects,  and  checking  assumptions  of  statistical  models.  

 

Page 33: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   33  

The  evidential  interpretation  is  crucial  but  error  probabilities  can  be  violated  by  selection  effects  (also  violated  model  assumptions)   One  function  of  severity  is  to  identify  which  selection  effects  are  problematic  (not  all  are)  (#3).      Biasing  selection  effects:  when  data  or  hypotheses  are  selected  or  generated  (or  a  test  criterion  is  specified),  in  such  a  way  that  the  minimal  severity  requirement  is  violated,  seriously  altered  or  incapable  of  being  assessed.      

   

Page 34: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   34  

Nominal vs actual significance levels Suppose  that  twenty  sets  of  differences  have  been  examined,  that  one  difference  seems  large  enough  to  test  and  that  this  difference  turns  out  to  be  ‘significant  at  the  5  percent  level.’  ….The  actual  level  of  significance  is  not  5  percent,  but  64  percent!  (Selvin,  1970,  p.  104)    • They  were  clear  on  the  fallacy:  blurring  the  “computed”  or  “nominal”  significance  level,  and  the  “actual”  level  

 • There  are  many  more  ways  you  can  be  wrong  with  hunting  (different  sample  space)  

     

Page 35: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   35  

This is a genuine example of an invalid or unsound method  You report: Such  results  would  be  difficult  to  achieve  under  the  assumption  of  H0 When  in  fact  such  results  are  common  under  the  assumption  of  H0

(formally): You say Pr(P-value < Pobs; H0) ~ α (small)   but in fact Pr(P-value < Pobs; H0) = high, if not guaranteed • Nowadays,  we’re  likely  to  see  the  tests  blamed  for  permitting  such  misuses  (instead  of  the  testers).  

 • Worse  are  those  accounts  where  the  abuse  vanishes!  

Page 36: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   36  

What  defies  scientific  sense?    On  some  views,  biasing  selection  effects  are  irrelevant….  

Stephen  Goodman  (epidemiologist):    Two  problems  that  plague  frequentist  inference:  multiple  comparisons  and  multiple  looks,  or,  as  they  are  more  commonly  called,  data  dredging  and  peeking  at  the  data.  The  frequentist  solution  to  both  problems  involves  adjusting  the  P-­‐value…But  adjusting  the  measure  of  evidence  because  of  considerations  that  have  nothing  to  do  with  the  data  defies  scientific  sense,  belies  the  claim  of  ‘objectivity’  that  is  often  made  for  the  P-­‐value.”  (1999,  p.  1010).    

Page 37: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   37  

Likelihood  Principle  (LP)  The  vanishing  act  takes  us  to  the  pivot  point  around  which  much  debate  in  philosophy  of  statistics  revolves:   In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses:

P(x0;H1)/P(x0;H0)    

Different  forms:  posterior  probabilities,  Bayes  factor  (inference  is  comparative,  data  favors  this  over  that–is  that  even  inference?)      

Page 38: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   38  

All  error  probabilities  violate  the  LP  (even  without  selection  effects):    “Sampling  distributions,  significance  levels,  power,  all  depend  on  something  more  [than  the  likelihood  function]–something  that  is  irrelevant  in  Bayesian  inference–namely  the  sample  space”.  (Lindley  1971,  p.  436)    The  information  is  just  a  matter  of  our  “intentions”   “The  LP  implies…the  irrelevance  of  predesignation,  of  whether  a  hypothesis  was  thought  of  before  hand  or  was  introduced  to  explain  known  effects  (Rosenkrantz,  1977,  122)  

Page 39: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   39  

Many current Reforms are Probabilist Probabilist reforms to replace tests (and CIs) with likelihood ratios, Bayes factors, HPD intervals, or just lower the P-value (so that the maximal likely alternative gets .95 posterior)

while ignoring biasing selection effects, will fail.  The same p-hacked hypothesis can occur in Bayes factors; optional stopping can exclude true nulls from HPD intervals.

With one big difference: Your direct basis for criticism and possible adjustments has just vanished. (lots of #2 inconsistencies)  

Page 40: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   40  

How  might  probabilists  block  intuitively  unwarranted  inferences?  (Consider  first  subjective) When we hear there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences), some probabilists claim—you see, if our beliefs were mixed into the interpretation of the evidence, we wouldn’t be fooled

We know these things are unbelievable, a subjective Bayesian might say That could work in some cases (though it still wouldn’t show what researchers had done wrong)—battle of beliefs.

Page 41: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   41  

It wouldn’t help with our most important problem:

• How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, registered results and precautions)?

So now you’ve got two sources of flexibility, priors and biasing selection effects (which can no longer be criticized). Besides, researchers really do believe their hypotheses.

Page 42: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   42  

Diederik Stapel says he always read the research literature extensively to generate his hypotheses.

“So that it was believable and could be argued that this was the only logical thing you would find.” (E.g., eating meat causes aggression.) (In “The Mind of a Con Man,” NY Times, April 26, 2013[4])

Page 43: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   43  

Conventional  Bayesians  

The most popular probabilisms these days are “non-subjective” (reference, default) or conventional designed  to  prevent  prior  beliefs  from  influencing  the  posteriors:   “The  priors  are  not  to  be  considered  expressions  of  uncertainty,  ignorance,  or  degree  of  belief.  Conventional  priors  may  not  even  be  probabilities…  .”  (Cox  and  Mayo  2010,  p.  299)    How  might  they  avoid  too-­‐easy  rejections  of  a  null?      

Page 44: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   44  

Cult  of  the  Holy  Spike    

Give  a  spike  prior  of  .5  to  H0  the  remaining  .5  probability  being  spread  out  over  the  alternative  parameter  space,  Jeffreys.      This  “spiked  concentration  of  belief  in  the  null”  is  at  odds  with  the  prevailing  view  “we  know  all  nulls  are  false”  (#2)      Bottom line: By convenient choices of priors and alternatives statistically significant differences can be evidence for the null

 The  conflict  often  considers  the  two  sided  test    H0:  µ  =  0  versus  H1:  µ  ≠  0        

Page 45: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   45  

   Posterior  Probabilities  in  H0  

                                                                             n  (sample  size)                                                                ____________________________      p                        z                        n=50          n=100            n=1000    .10              1.645                    .65                        .72                          .89  .05              1.960                    .52                        .60                          .82  .01              2.576                    .22                        .27                          .53  .001          3.291                    .034                    .045                    .124  

 If  n  =  1000,  a  result  statistically  significant  at  the  .05  level  leads  to  a  posterior  to  the  null  of  .82!    From  Berger  and  Sellke  (1987)  based  on  a  Jeffreys  pror  

Page 46: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   46  

 • With  a  z  =  1.96  difference,  the  95%  CI  (2-­‐sided)  or  the  .975  CI  one  sided  excludes  the  null  (0)  from  the  interval    

• Severity reasoning: Were H0 true, the probability of getting

d(x) < dobs is high (~.975), so SEV  (µ  >  0) ∼ .975

• But they give P(H0 | z = 1.96 ) = .82

• Error statistical critique: there’s a high probability that they give posterior probability of .82 to H0:µ = 0 erroneously

• The onus is on probabilists to show a high posterior for H

constitutes having passed a good test.

Page 47: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   47  

Informal  and  Quasi-­‐Formal  Severity  :  H  -­‐>  H*    

• Error  statisticians  avoid  the  fallacy  of  going  directly  from  statistical  to  research  hypothesis  H*    

• Can  we  say  nothing  about  this  link?  • I  think  we  can  and  must,  and  informal  severity  assessments  are  relevant  (#3)  

 I  will  not  discuss  straw  man  studies  (“chump  effects”).    

This is believable: Men react more negatively to success of their partners than to their failures (compared to women)?

Studies have shown: H: partner’s success lowers self-esteem in men

Page 48: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   48  

Macho  Men  H*: partner’s success lowers self-esteem in men

 I  have  no  doubts  that  certain  types  of  men  feel  threatened  by  the  success  of  their  female  partners,  wives  or  girlfriends      I’ve  even  known  a  few.    

Can  this  be  studied  in  the  lab?  Ratliff  and  Oishi  (2013)  did:  .    

H*:  “men’s  implicit  self-­‐esteem  is  lower  when  a  partner  succeeds  than  when  a  partner  fails.”    

Not so for women Their example does a good job, given the standards in place.

Page 49: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   49  

Treatments: Subjects are randomly assigned to five  “treatments”:  think,  write  about  a  time  your  partner  succeeded,  failed,  succeeded  when  you  failed  (partner  beats  me),  failed  when  you  succeeded  (I  beat  partner),  and  a  typical  day  (control).    

 Effects:  a  measure  of  “self-­‐esteem”  Explicit:  “How  do  you  feel  about  yourself?”  Implicit:  a test of word associations with “me” versus “other”. None showed statistical significance in explicit self-esteem, so consider just implicit measures

   

Page 50: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   50  

 Some null hypotheses: The average self-esteem score is no different (these are statistical hypotheses) a) when partner succeeds (rather than failing) b) when partner beats (surpasses) me or I beat her c) control: when she succeeds, fails, or it’s a regular day There are at least double this, given self-esteem could be “explicit” or “implicit” (others too, e.g., the area of success)

 Only  null  (a)  was  rejected  statistically!  

Should  they  have  taken  the  research  hypothesis  as  

disconfirmed  by  negative  cases?    Or  as  casting  doubt  on  their  test?    

Page 51: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   51  

Or  should  they  just  focus  on  the  null  hypotheses  that  were  rejected,  in  particular  null  (a),  for  implicit  self-­‐esteem.    

They  opt  for  the  third.    It’s not that they should have regarded their research

hypothesis H* as disconfirmed much less falsified.  This is precisely the nub of the problem! I’m saying the hypothesis that the study isn’t well-run needs to be considered

• Is the artificial writing assignment sufficiently relevant to the phenomenon of interest? (look at proxy variables)

• Is the measure of implicit self esteem (word associations) a

valid measure of the effect? (measurements of effects)

Page 52: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   52  

Take,  null  hypothesis  b):  The average self-esteem score is no different when partner beats (surpasses) me or I beat her

 Clearly  they  expected  “she  beat  me  in  X”  to  have  a  greater  negative  impact  on  self-­‐esteem  than  “she  succeeded  at  X”.    

 Still,  they  could  view  it  as  lending  “some  support  to  the  idea  that  men  interpret  ‘my  partner  is  successful’  as  ‘my  partner  is  more  successful  than  me”  (p.  698),    

….as  do  the  authors.        That  is,  any  success  of  hers  is  always  construed  by  Macho  man  as,  she  beat  me.    

Page 53: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   53  

Bending  over  Backwards  For  the  stringent  self-­‐critic,  this  skirts  too  close  to  viewing  the  data  through  the  theory,  a  kind  of  “self-­‐sealing  fallacy”.      I want to be clear that this is not a criticism of them given existing standards “I'm talking about a specific, extra type of integrity...bending over backwards to show how you're maybe wrong, that you ought to have when acting as a scientist.”   (R. Feynman 1974)  

I’m  describing  what’s  needed  to  show  “sincerely  trying  to  find  flaws”  under  the  austere  account  I  recommend  

 The  most  interesting  information  was  never  reported!  Perhaps  it  was  never  even  looked  at:  what  they  wrote  about.    

Page 54: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   54  

Conclusion: Replication Research in Psychology Under an Error Statistical Philosophy

Replication problems can’t be solved without correctly understanding their sources  Biggest  sources  of  problems  in  replication  crises  (a) Stat  H  -­‐>research  H*  and  (b)  biasing  selection  effects:    

Reasons for (a): focus on P-values and Fisherian tests ignoring N-P tests (and the illicit NHST that goes directly H–> H*)

Page 55: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   55  

Another reason, false dilemma: probabilism or long-run performance

plus assuming that N-P can only give the latter I argue for a third use of probability: Rather than report on believability researchers need to report the properties of the methods they used:

What was their capacity to have identified, avoided, admitted bias?

What’s  wanted  is  not  a  high  posterior  probability  in  H  (however  construed)  but  a  high  probability  the  procedure  would  have  unearthed  flaws  in  H  (reinterpretation  of  N-­‐P  methods)  

Page 56: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   56  

What’s  replicable?  Discrepancies  that  are  severely  warranted  

Reasons  for  (b)  [embracing  accounts  that  formally  ignore  selection  effects]:  accepting  probabilisms  that  embrace  the  likelihood  principle  LP    There’s  no  point  in  raising  thresholds  for  significance  if  your  methodology  does  not  pick  up  on  biasing  selection  effects.      

Page 57: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   57  

 Informal assessments of probativeness are needed to scrutinize statistical inferences in relation to research hypotheses H –> H* One  hypothesis  must  always  be:  our  results  point  to  the  inability  of  our  study  to  severely  probe  the  phenomenon  of  interest  (problem  with  proxy  variables,  measurements,  etc.)     The scientific status of an inquiry is questionable if it cannot or will not distinguish the correctness of inferences from problems stemming from a poorly run study If ordinary research reports adopted the Feynman “bending over backwards” scrutiny, the interpretation of replication efforts would be more informative (or perhaps not needed)

Page 58: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   58  

REFERENCES    Baggerly,  K.  A.,  Coombes,  K.  R.  &  Neeley,  E.  S.  (2008).  “Run  Batch  Effects  

Potentially  Compromise  the  Usefulness  of  Genomic  Signatures  for  Ovarian  Cancer.”  Journal  of  Clinical  Oncology.  26(7):  1186-­‐1187.  

Bartless,  T.  (2012).  “Daniel  Kahneman  Sees  ‘Train-­‐Wreck  Looming’  for  Social  Psychology”.  Chronicle  of  Higher  Education  Blog  (Oct.  4,  2012)  article  w/links  to  email  D.  Kahneman  sent  to  several  social  psychologists.    http://chronicle.com/blogs/percolator/daniel-­‐kahneman-­‐sees-­‐train-­‐wreck-­‐looming-­‐for-­‐social-­‐psychology/31338.  

Berger,  J.  O.  (2006).  “The  Case  for  Objective  Bayesian  Analysis.”  Bayesian  Analysis  1  (3):  385–402.  

Berger,  J.  O.  &  Sellke,  T.  (1987).  “Testing  a  Point  Null  Hypothesis:  The  Irreconcilability  of  P  Values  and  Evidence  (with  Discussion).”  Journal  of  the  American  Statistical  Association  82  (397)  (March  1):  112–122.  

Bhattacharjee,  Y.  (2013).  “The  Mind  of  a  Con  Man”.  The  New  York  Times  Magazine  (4/28/2013),  p.  44.  

Cohen,  J.  1988.  Statistical  Power  Analysis  for  the  Behavioral  Sciences.  2nd  ed.  Hillsdale,  NJ:  Erlbaum.  

Page 59: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   59  

 Coombes,  K.  R.,  Wang,  J.  &  Baggerly,  K.  A.  (2007).  “Microrrays:  retracing  steps.”  

Nature  Medicine.  13(11):1276-­‐7.  Cox,  D.  R.  &  D.  V.  Hinkley.  (1974).  Theoretical  Statistics.  London:  Chapman  and  

Hall.  Cox,  D.  R.  &  Mayo,  D.  G.  (2010).  “Objectivity  and  Conditionality  in  Frequentist  

Inference.”  In  Error  and  Inference:  Recent  Exchanges  on  Experimental  Reasoning,  Reliability,  and  the  Objectivity  and  Rationality  of  Science,  edited  by  Deborah  G.  Mayo  and  Aris  Spanos,  276–304.  Cambridge:  Cambridge  University  Press.  

Diaconis,  P.  (1978).  “Statistical  Problems  in  ESP  Research”.  Science  201  (4351):  131-­‐136.  (Letters  in  response  can  be  found  in  the  Dec.  15,  1978  issue  pp.  1145-­‐6.)  

Dienes,  Z.  (2011)  “Bayesian  versus  Orthodox  Statistics:  Which  Side  Are  You  On?”  Perspectives  on  Psychological  Science  6(3):  274-­‐290.  

Feynman,  R.    (1974).  “Cargo  Cult  Science.”  Caltech  Commencement  Speech.  Fisher,  R.  A.  (1947).  The  Design  of  Experiments,  4th  ed.  Edinburgh:  Oliver  and  

Boyd.  

Page 60: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   60  

Gelman,  A.  (2011).  “Induction  and  Deduction  in  Bayesian  Data  Analysis.”  Edited  by  Deborah  G.  Mayo,  Aris  Spanos,  and  Kent  W.  Staley.  Rationality,  Markets  and  Morals:  Studies  at  the  Intersection  of  Philosophy  and  Economics  2  (Special  Topic:  Statistical  Science  and  Philosophy  of  Science):  67–78.  

Gelman,  A.  &  Shalizi,  C.  (2013).  “Philosophy  and  the  Practice  of  Bayesian  Statistics.”  British  Journal  of  Mathematical  and  Statistical  Psychology  66  (1):  8–38.  

Gigerenzer,  G.  (2000).  “The  Superego,  the  Ego,  and  the  Id  in  Statistical  Reasoning.  “  Adaptive  Thinking,  Rationality  in  the  Real  World,  OUP.  

Goodman,  S.  N.  (1999).  Toward  evidence-­‐based  medical  statistics.  2:  The  Bayes  factor.”  Annals  of  Internal  Medicine,  130:1005  –1013.  

Howson,  C.  &  Urbach,  P.  (1993).  Scientific  Reasoning:  The  Bayesian  Approach.  2nd  ed.  La  Salle,  IL:  Open  Court.  

Johansson  T.  (2010)  “Hail  the  impossible:  p-­‐values,  evidence,  and  likelihood.”  Scandinavian  Journal  of  Psychology  52:113-­‐125.  

Kruschke,  J.  K.  (2010).  “What  to  believe:  Bayesian  methods  for  data  analysis”.  Trends  in  Cognitive  Science,  14(7):  297-­‐300.  

Lehmann,  E.  L.  (1993).  “The  Fisher,  Neyman-­‐Pearson  Theories  of  Testing  

Page 61: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   61  

Hypotheses:  One  Theory  or  Two?”  Journal  of  the  American  Statistical  Association  88  (424):  1242–1249.  

Levelt  Committee,  Noort  Committee,  Drenth  Committee.  (2012).  “Flawed  science:  The  fraudulent  research  practices  of  social  psychologist  Diederik  Stapel”.  Stapel  Investigation:  Joint  Tilburg/Groningen/Amsterdam  investigation  of  the  publications  by  Mr.  Stapel.  https://www.commissielevelt.nl/  

Lindley,  D.  V.  (1971).  “The  Estimation  of  Many  Parameters.”  In  Foundations  of  Statistical  Inference,  edited  by  V.  P.  Godambe  and  D.  A.  Sprott,  435–455.  Toronto:  Holt,  Rinehart  and  Winston.  

Mayo,  D.  G.  (1996).  Error  and  the  Growth  of  Experimental  Knowledge.  Science  and  Its  Conceptual  Foundation.  Chicago:  University  of  Chicago  Press.  

Mayo,  D.  G.  &  Cox,  D.  R.  (2010).  "Frequentist  Statistics  as  a  Theory  of  Inductive  Inference"  in  Error  and  Inference:  Recent  Exchanges  on  Experimental  Reasoning,  Reliability  and  the  Objectivity  and  Rationality  of  Science  (D.  Mayo  and  A.  Spanos  eds.),  Cambridge:  Cambridge  University  Press:  1-­‐27.  This  paper  appeared  in  The  Second  Erich  L.  Lehmann  Symposium:  Optimality,  2006,  Lecture  Notes-­‐Monograph  Series,  Volume  49,  Institute  of  Mathematical  Statistics,  pp.  247-­‐275.  

Page 62: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   62  

Mayo,  D.  G.,  and  A.  Spanos.  (2006).  “Severe  Testing  as  a  Basic  Concept  in  a  Neyman–Pearson  Philosophy  of  Induction.”  British  Journal  for  the  Philosophy  of  Science  57  (2)  (June  1):  323–357.    

Mayo,  D.  G.,  and  A.  Spanos.    (2011).  “Error  Statistics.”  In  Philosophy  of  Statistics,  edited  by  Prasanta  S.  Bandyopadhyay  and  Malcom  R.  Forster,  7:152–198.  Handbook  of  the  Philosophy  of  Science.  The  Netherlands:  Elsevier.  

Meehl,  P.  E.  &  Waller,  N.  G.  (2002).  “The  Path  Analysis  Controversy:  A  New  Statistical  Approach  to  Strong  Appraisal  of  Verisimilitude.”  Psychological  Methods  7(3):  283–300.  

Morrison,  D.  E.  &  Henkel,  R.  E.  (eEds).  (1970).  The  Significance  Test  Controversy:  A  Reader.  Chicago:  Aldine  De  Gruyter.  

Micheel,  C.  M.,  Nass,  S.  J.  &  Omenn  G.  S.  (Eds)  Committee  on  the  Review  of  Omics-­‐Based  Tests  for  Predicting  Patient  Outcomes  in  Clinical  Trials;  Board  on  Health  Care  Services;  Board  on  Health  Sciences  Policy;  Institute  of  Medicine  (2012).  Evolution  of  Translational  Omics:  Lessons  Learned  and  the  Path  Forward.  Nat.  Acad.  Press.    

Neyman,  J.  (1957).  “‘Inductive  Behavior’”  as  a  Basic  Concept  of  Science.”  Revue  de  l'Institut  International  de  Statistique/Review  of  the  International  Statistical  Institute,  25  (1/3):  7-­‐22.  

Page 63: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   63  

Neyman,  J.  &  Pearson,  E.  S.  (1928).  “On  the  Use  and  Interpretation  of  Certain  Test  Criteria  for  Purposes  of  Statistical  Inference.  Part  I,”  Biometrica  20A:  175-­‐240  (reprinted  in  Joint  Statistical  Papers,  University  of  California  Press,  Berkeley,  1967,  pp.  1-­‐66.)  

Popper,  K.  (1962).  Conjectures  and  Refutations:  The  Growth  of  Scientific  Knowledge.  New  York:  Basic  Books.  

Potti,  A.,  Dressman  H.  K.,  Bild,  A.,  Riedel,  R.  F.,  Chan,  G.,  Sayer,  R.,  Cragun,  J.,  Cottrill,  H.,  Kelley,  M.  J.,  Petersen,  R.,  Harpole,  D.,  Marks,  J.,  Berchuck,  A.,  Ginsburg,  G.  S.,  Febbo,  P.,  Lancaster,  J.    &  Nevins,  J.  R.    (2006).  “Genomic  signatures  to  guide  the  use  of  chemotherapeutics.”  Nature  Medicine.  Nov  12(11):1294-­‐300.  Epub  2006  Oct  22.    

Potti,  A.  &  Nevins,  J.  R.  (2007)  “Reply  to  Coombes,  Wang  &  Baggerly.”  Nature  Medicine  Nov  13(11):1277-­‐8.    

Ratliff,  K.  A.  &  Oishi,  S.  (2013).  “Gender  Differences  in  Implicit  Self-­‐Esteem  Following  a  Romantic  Partner’s  Success  or  Failure”.    Journal  of  Personality  and  Social  Psychology  105(4):  688–702.  

Rosenkrantz,  R.  (1977).  Inference,  Method  and  Decision:  Towards  a  Bayesian  Philosophy  of  Science.  Dordrecht,  The  Netherlands:  D.  Reidel.  

Page 64: D. Mayo: Replication Research Under an Error Statistical Philosophy

SPP  D.  Mayo   64  

Savage,  L.  J.  (1962).  The  Foundations  of  Statistical  Inference:  A  Discussion.  London:  Methuen.  

Savage,  L.  J.  (1964).  “The  Foundations  of  Statistics  Reconsidered.”  In  Studies  in  Subjective  Probability,  H.  Kyburg  &  H.  Smokler  (eds.),  173-­‐188.  New  York:  John  Wiley  &  Sons.  

Selvin,  H.  (1970).  “A  Critique  of  Tests  of  Significance  in  Survey  Research.”  In  The  Significance  Test  Controversy,  edited  by  D.  Morrison  and  R.  Henkel,  94-­‐106.  Chicago:  Aldine  De  Gruyter.  

Trafimow,  D.  &  Marks  M.  (2015).  “Editorial”.  Basic  and  Applied  Social  Psychology,  37(1),  pp.  1-­‐2.  

Wagenmakers,  E.-­‐J.  (2007).  “A  Practical  Solution  to  the  Pervasive  Problems  of  P  Values”.  Psychonomic  Bulletin  &  Review  14  (5),  779-­‐804.