senior research fellow, telethon kids institute · 2020. 7. 6. · observational vs experimental...

32
Introductory Biostatistics Dr Julie Marsh Senior Research Fellow, Telethon Kids Institute 5 June 2020 Research Skills Seminar Series | CAHS Research Education Program Research support, development and governance

Upload: others

Post on 12-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

Introductory Biostatistics

Dr Julie Marsh Senior Research Fellow, Telethon Kids Institute

5 June 2020

Research Skills Seminar Series | CAHS Research Education ProgramResearch support, development and governance

Page 2: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

health.wa.gov.au/cahs

© CAHS Research Education Program, Department of Child Health Research, Child and Adolescent Health Service, WA 2020

Copyright to this material produced by the CAHS Research Education Program, Department of Child Health Research, Child and Adolescent Health Service, Western Australia, under the provisions of the Copyright Act 1968 (C’wth Australia). Apart from any fair dealing for personal, academic, research or non-commercial use, no part may be reproduced without written permission. The Department of Child Health Research is under no obligation to grant this permission. Please acknowledge the CAHS Research Education Program, Department of Child Health Research, Child and Adolescent Health Service when reproducing or quoting material from this source.

Page 3: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

1

Introductory Biostatistics

Dr Julie MarshTelethon Kids Institute

Research Skills Seminar Series | CAHS Research Education Program 

Key learning outcomes

• Define the difference between observational and experimental studies

• Understand the effect of chance, bias and confounding in experimental designs

• Explain the variability in estimating parameters from a sample using a confidence interval

• Interpret results from statistical hypothesis testing using a p‐value

• Understand importance of objective reporting of experimental studies

Overview

Observational vs Experimental Design  Chance, Bias & confounding Sampling Error Understanding confidence intervals Interpreting p‐values Guidelines & reporting

Observational vs Experimental Design

Animal and human studies

Page 4: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

2

Statisticians speak in code but they hide it in everyday language.

An observational study draws inferences from a sample to a population, where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints.

Observational vs Experimental Design

An observational study draws inferences from a sample to apopulation, where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints.

Inference is where we infer (guess!) the properties of a population, i.e. we may assume that the heights of the people in this room are normally distributed.

Observational vs Experimental Design

An observational study draws inferences from a sample to apopulation, where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints.

A sample is a set of data collected or selected from a population using a defined procedure, i.e. we may chose to only measure the heights of the people in the first 5 rows of the lecture theatre. 

Observational vs Experimental Design

Page 5: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

3

An observational study draws inferences from a sample to apopulation, where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints.

A statistical population is all members of a defined group, i.e. everyone currently sitting in the lecture theatre. 

Observational vs Experimental Design

An observational study draws inferences from a sample to apopulation, where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints.

The independent variable is the variable that is changed or controlled in an experimental design or simply observed in an observational design, i.e. the recording of each individual’s sex.

Observational vs Experimental Design

An observational study draws inferences from a sample to apopulation, where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints.

In observational designs, statistical control often refers to the technique of separating out the effect of one independent variable on the outcome. In experimental designs it usually refers to the use of a comparator group (i.e. placebo or standard care) and assumes all other factors are constant.

Observational vs Experimental Design

Experimental design refers to a plan for assigning units (patients, mice, etc.) to treatments or exposures or interventions. A good design addresses: 

• Association. It ensures that the experimenter can detect a relationship between the independent variable and the outcome (dependent variable).

• Control. It allows the experimenter to rule out alternative explanations due to the confounding effects of factors other than the independent variable.

• Variability. It minimises variability within the treatment/exposure/intervention groups, which reduces the number of units (eg. patients) we need to study to detect a difference in the outcome between these groups. 

Observational vs Experimental Design

Page 6: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

4

Experimental design refers to a plan for assigning units (patients, mice, etc.) to treatments or exposures or interventions. A good design addresses: 

• Association. It ensures that the experimenter can detect an association between the independent variable and the outcome (dependent variable).

• Control. It allows the experimenter to rule out alternative explanations due to the confounding effects of factors other than the independent variable.

• Variability. It minimises variability within the treatment/exposure/intervention groups, which reduces the number of units (eg. patients) we need to study to detect a difference in the outcome between these groups. 

Observational vs Experimental Design

Experimental design refers to a plan for assigning units (patients, mice, etc.) to treatments or exposures or interventions. A good design addresses: 

• Association. It ensures that the experimenter can detect an association between the independent variable and the outcome (dependent variable).

• Control. It allows the experimenter to rule out alternative explanations due to the confounding effects of factors other than the independent variable.

• Variability. It minimises variability within the treatment/exposure/intervention groups, which reduces the number of units (eg. patients) we need to study to detect a difference in the outcome between these groups. 

Observational vs Experimental Design

The highest level of evidence in medical research comes from an particular set of experimental designs called randomised controlled trials (RCT). 

In these designs patients are randomly allocated to treatment groups using a computer algorithm.

Observational vs Experimental Design

RandomisationIt is often not possible to control for all factors in a RCT. Randomisation minimises potential bias from factors other than the independent variable and attempts to balance confounding factors by distributing patients equally across the intervention arms. 

How can we evaluate if the randomisation has worked?

Observational vs Experimental Design

Page 7: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

5

Generalisability The results from an RCT have internal validity but can only be generalised to other populations with similar characteristics. These characteristics are determined by the inclusion/exclusion criteria.

External validity is the extent to which the results of a study can be generalized to and across other situations, people, stimuli, and times.

Observational vs Experimental Design Chance, Bias & Confounding

The critical assessment of research results must rule out or minimize:

Chance     Confounding        Bias

Chance, Bias & Confounding

ChanceStatisticians focus on the likelihood that a result or relationship is caused by something other than random chance.

BiasWhere a systematic error in the design, recruitment, data collection or analysis results in a incorrect estimation of the true effect of the exposure/intervention on the outcome.

Chance, Bias & Confounding

Selection bias: i.e. low mortality rates for farmers (Is farming healthy or unhealthy? People don’t chose it or perhaps more likely to leave it if they are in poor health?)

Recall bias: Individuals diagnosed with disease (cases) more likely to remember exposure than non‐cases 

Efficacy bias: if treatment allocation is known then responses may be distorted by the subject or those responsible for the subject’s management and evaluation. 

Page 8: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

6

Chance, Bias & Confounding

Survival bias: Aggregating data over age categories – an individual can only appear in highest age category if they survive long enough! (eg. Simpson’s Paradox example for lung cancer and smoking)

Treatment allocation bias: choice of treatment should not be related to prognostic factors (such as disease severity, age, etc.)

Protocol/Study deviations: e.g. cross contamination in a school‐based intervention programme or not taking study medication (hence RCT include both intent‐to‐treat and per‐protocol analyses)

Chance, Bias & Confounding

Confounding Factors

Where the relationship between intervention or exposure and outcome is distorted by another variable.

Chance, Bias & Confounding

More formally: In a study of the effect of variable X on disease Y, a third variable, W, may be a confounder if (a) it is a risk factor for disease Y, and (b) it is associated with (but does not cause) variable X.

Chance, Bias & Confounding

The results from a large systematic review (2013) are presented in the following summary of subgroup analyses (n is the number of  eligible studies).

Ding et al. Long-Term Coffee Consumption and Risk of Cardiovascular Disease A Systematic Review and a Dose–Response Meta-Analysis of Prospective Cohort Studies. Circulation (2013).

Protective Harmful

Page 9: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

7

Chance, Bias & Confounding

The results from a large systematic review (2013) are presented in the following summary of subgroup analyses (n is the number of  eligible studies).

Ding et al. Long-Term Coffee Consumption and Risk of Cardiovascular Disease A Systematic Review and a Dose–Response Meta-Analysis of Prospective Cohort Studies. Circulation (2013).

Protective Harmful

Chance, Bias & Confounding

Language can also be used to distort the evidence from medical research. Objective reporting includes a point estimate and level of precision. e.g. differences, proportions, odds ratio, relative risk, hazard ratio or counts

Size of effect is Objectivewhereas Strength is Subjective

Chance, Bias & Confounding

Good Clinical Practice (GCP) and Experimental Design guidelines help to minimise bias and confounding.

Hypothesis Testing assesses whether we could have observed the results by chance alone.

Confidence Intervals help us understanding the underlying variation in the results.

Reporting Guidelines attempt to minimise subjective reporting.

Sampling Error

“What is the probability that I would have obtained this set of data due to chance alone?”

We cannot process thousands of lab samples or recruit thousands of participants to our study so we need to select a representative sample at random. 

Page 10: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

8

Sampling Error

Hypothesis testing is used to quantify our belief that our current results are due to chance alone. But first we need to model the data using a probability distribution and understand the variability in the results (sampling error).

Histogram Probability density functions

Sampling Error

The Normal (Gaussian) distribution has 2 shape parameters; mean and standard deviation. As the shape of the curve changes, the area under the curve changes and we obtain different probabilities.

Wikipedia

Focusing on the sample mean, its variability is described by the sampling distribution:

x ~ N[μ,σ2/n]

Normal compared to T Distribution

For a One‐Sample T‐Test: df=n‐1 (sample size minus 1)

2

When sample sizes are small then it is necessary to use Student’s T distribution

The “degrees of freedom” (df) is simply the shape parameter for the T-Distribution

Normal compared to T Distribution• As sample size increases, df increases and• T‐Distribution increasingly approximates a Normal Distribution

2

Typical mouseStudy, n=6

Typical phase1 study, n=12

Incorrect use of “CLT”, n=30

If n>120 useNormal distribution

Page 11: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

9

Accuracy of the Sample Mean depends on:• Sample Size (n)• Sample Standard Deviation (s)

The variability associated with the sample mean is described by its sampling distribution: x ~ N[μ,σ2/n]

We estimate σ (population standard deviation) with s, so the standard error of the sample mean is s/√n

Sampling ErrorThe size of the sample determines the precision of the conclusions by taking into account:

Variability of the measurements Sensitivity of the Statistical Method Clinically relevant Difference (effect size)

Sampling Error

Understanding confidence intervals

We collect a sample of birth weights from the offspring of mothers who exercised in pregnancy. We wish to know if there is any difference compared to the WA population mean birth weight (population mean μ=3.4kg).

Understanding confidence intervals

From the sample of 152 babies:

Sample Size (n) = 152 newborns Sample Mean (x ) = 3.320kg Sample Standard Deviation (s) = 0.424 Standard Error = 0.424/√152 = 0.0344

Confidence Intervals are an easy‐to‐interpret way of expressing the precision with which a statistic (e.g. x ) estimates a parameter (e.g. μ).

Page 12: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

10

Understanding confidence intervals

Remember the Sampling Distribution for x~ N[μ,σ2/n]

And we can standardise, where   Z = (x – μ )s/√n

For a standard normal random variable, Z 

p(‐1.96 < Z < 1.96) = 0.95

2

-

Note: Normal rather than t distribution as sample size is sufficiently large (n>120).

Understanding confidence intervalsTherefore 

p(‐1.96 <  (x – μ) < 1.96) = 0.95s/√n

Rearranging gives us

p(x – 1.96(s/√n)  <  μ < x + 1.96(s/√n) ) = 0.95

95% confidence interval for mean birth weight:x ± 1.96(s/√n)

3.320 ± 1.96x0.0344       i.e. 95% CI [3.25,3.39kg]

But how do we interpret this ?

Understanding confidence intervals

Intuitive:a confidence interval provides a range of plausible values for a parameter, in light of the observed data.

Formal:if a very large number of random samples (all of the same size) were collected from a population, and a 95% CI computed for each sample, then the true value of the parameter would lie in about 95% of these intervals.

Understanding confidence intervals

For 20 random samples, based on a population (true) mean of 100,

we could expect around one 95% CIto NOT contain the true mean. This is our Type I Error (α): i.e. 1 in 20 or 0.05 or 5%

Page 13: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

11

Understanding confidence intervals

For constant sample size and standard error, the width of the confidence interval is determined by the significance level (α)

• 99% CI is the widest• 90% CI the narrowest

Understanding confidence intervals Standard Error gives us an idea of the level of precision of our

estimate We can calculate a Confidence Interval based on

• the Estimate• its Standard Error and• the Known Properties (distribution) of the Estimate

The CI gives a plausible range of values for the true populationparameter.

In the birth weight example, based on the 95% CI [3.25,3.39kg], is it likely that the true mean is 3.4kg?

Interpreting p‐values

“What is the probability that I would have obtained this set of data due to chance alone, under the null hypothesis?”

Hypothesis Test (aka significance test):Is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables.

Interpreting p‐values

Continuing with the birth weight example, we can write the null (H0) and alternative (H1) hypotheses as:

H0 : true mean birth weigh is 3.4kg (=3.4) H1 : true mean birth weight is not 3.4kg (3.4)

This is a 2‐Sided Test as it does not specify the direction in which we are looking for  to depart from the value specified under the null hypothesis. 

Page 14: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

12

Interpreting p‐values

You may be familiar with the formula for a one‐sample t‐test:

z = x  – s/√n

x  = sample mean = mean under the null hypothesis (H0)s = sample standard deviationn = sample size

Interpreting p‐values

Substituting in the values from the birth weight example:z =  3.320 – 3.4  = ‐2.33

0.424/√152

Z is likely to take values close to zero if H0 is correct Z is likely to take extreme, large positive or negative values

if H0 is incorrect

Is ‐2.33 sufficiently extreme to cause us to reject H0?

Interpreting p‐values

In hypothesis testing we assume that H0 is true and for the birth weight example we know that the test statistic (Z) has a Standard Normal Distribution, N[0,1].

So how likely are we to see a result as least as extreme as the one observed (i.e. z < ‐2.33 & z > 2.33) ?

A P‐value is defined to be the probability of getting a test statistic at least as extreme as the one observed, under the assumption that H0 is true.

Interpreting p‐values

The p‐value calculation is based on the area under the curve in each tail of the standard normal distribution below:

p‐value  = p(Z < ‐2.33) + p(Z > 2.33) = 0.0099 + 0.0099= 0.0198

2

The probability in each “tail” can be obtained from statistical tables or statistical packages..

Standard Normal Distribution

z

Den

sity

P(z< -2.33 )= 0.0099 P(z> 2.33 )= 0.0099

-4.0

0

-3.0

0

-2.3

3-2

.00

-1.0

0

0.00

1.00

2.00

2.33

3.00

4.00

0.0

0.1

0.2

0.3

0.4

Page 15: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

13

Interpreting p‐valuesFrom the hypothesis test, p‐value = 0.0198

Since the p‐value is less than 0.05, we reject H0

• conclude that the mean birth weight in the exercise groupis lower the the WA mean birth weight.

Is this a clinically significant result for birth weight in the exercise group ?

mean: 3.32kg95% CI (3.25,3.39kg)p=0.02

2

Interpreting p‐values

Rejection of the null hypothesis (H0) is determined by thesignificance level (α is usually 5%, i.e. 0.05)

The significance level is the probability of a type I error(falsely rejecting a true null hypothesis)

Fisher (1965) did not intend the significance to be fixed at 5%,but set according to circumstances

2

A p‐value can detect or fail to detect an  association.

A p‐value does not measure the size of the association, so there is no basis for declaring 

if there is “strong” evidence.

Interpreting p‐values Statistical analyses

This is a great website that can guide you through the analysis stage if you do not have any statistical support: https://stats.idre.ucla.edu/other/mult‐pkg/whatstat/

Page 16: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

14

Statistical analyses

Links take you to examplecode for each type ofanalysis depending onyour preferred package.

Sample Size

The significance level or Type I Error (probability of falsely rejecting the null hypothesis)is illustrated below for a 2‐tailed hypothesis test.

Difference4.0-3.4 = 0.6kg

Sample Size = 10 per groupStandard Deviation = 0.5

Sample Size

But what is the statistical power of this test? Note: Type II Error is the probability of falsely failing to reject the null hypothesis.

Sample size=10 per groupStandard deviation=0.5

Power = 1 - Type II Error= 1 - 0.033= 0.967 (or 96.7%)

Difference4.0-3.4 = 0.6kg

Sample SizeUsually the statistical power of a hypothesis test is defined to be at least 80%.  In this case we could detect a difference of at least 0.443kg.

Sample size=10 per groupStandard deviation=0.5

Power = 1 - Type II Error= 1 - 0.2= 0.8 (or 80%)

Difference3.843-3.4 = 0.443kg

Type II Error Type I Error

Page 17: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

15

Standardised reporting

2

P‐values give the likelihood that an observed association could have arisen solely as a result of chance bias in sampling – it is dependent on sample size and variation.

Estimate of effect size – gives direction and magnitude.

Confidence intervals are an estimate of variation in effect between samples (chance bias only).  

Standardised reporting

How do we infer causality?

p‐value < 0.05 (significance level) ?

95% confidence interval does not include 0 (for differences) or 1 (for ratios) ?

2

Standardised reporting

How do we infer causality?

p‐value < 0.05 (significance level) ?

95% confidence interval does not include 0 (for differences) or 1 (for ratios) ?

Statistical significance cannot distinguish causal from non‐causal associations

2

XStandardised reporting

Reporting guidelines provide checklists, depending on the type of study design, to ensure that published research includes sufficient details for us to critically assess the quality of the research:CONSORT, TREND, STROBE, REMARK, STREGA, PRISMA 

These guidelines should also be consulted: before you Design the Study before you Plan the Analysis before you Write the Paper before you Submit the Paper

Page 18: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

6/4/2020

16

Key learning outcomes

• Define the difference between observational and experimental studies

• Understand the effect of chance, bias and confounding in experimental designs

• Explain the variability in estimating parameters from a sample using a confidence interval

• Interpret results from statistical hypothesis testing using a p‐value

• Understand importance of objective reporting of experimental studies

Questions?

Upcoming Research Skills Seminars:12 Jun Actions required for research protocols and analysis

plans impacted by COVID-19 Dr Julie Marsh

26 Jun Consumer and Community Involvement Annie McKenzie AM

Please give us feedback!Complete an online survey via: https://redcap.telethonkids.org.au/redcap/surveys/?s=WHHPN7CYTL

*Full 2020 seminar schedule in back of handouts

© CAHS Research Education Program,Department of Child Health Research, Child and Adolescent Health Service, WA 2020

Copyright to this material produced by the CAHS Research EducationProgram, Department of Child Health Research, Child and AdolescentHealth Service, Western Australia, under the provisions of the CopyrightAct 1968 (C’wth Australia). Apart from any fair dealing for personal,academic, research or non-commercial use, no part may be reproducedwithout written permission. The Department of Child Health Research isunder no obligation to grant this permission. Please acknowledge theCAHS Research Education Program, Department of Child HealthResearch, Child and Adolescent Health Service when reproducing orquoting material from this source.

Page 19: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

2 ADDITIONAL WEBSITES  

CAHS - COVID-19 and research processeshttps://cahs.health.wa.gov.au/Research/COVID19-and-research-processes

UCLA – Institute for Digital Research and Education Choosing the Correct Statistical Test in SAS, STAT, SPSS and R https://stats.idre.ucla.edu/other/mult-pkg/whatstat/

John Hopkins University – Data Science Specialization https://www.coursera.org/courses?languages=en&query=johns%20hopkins%20data%20science

3 STATISTICAL ANALYSIS UCLA Institute for Digital Research & Education https://idre.ucla.edu/

4 BRADFORD‐HILL CRITERIA FOR CAUSALITY 

Temporal relationship Strength of relationship Dose‐response Consistency Plausibility Consideration of alternative explanations Experiment Specificity Coherence

5 STATISTICAL SUPPORT CONTACTS

5.1 Perth Children’s Hospital 

Department of Research Child and Adolescent Health Service [email protected]  https://cahs.health.wa.gov.au/en/Research/For-researchers/Support-for-researchers

Biostatistics and Data Management Support through Department of Research [email protected] https://cahs.health.wa.gov.au/en/Research/For-researchers/Support-for-researchers

5.2 Telethon Kids Institute Consultancy Service [email protected]  

20

Page 20: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

5.3 University of Western Australia – The Centre for Applied Statistics *offers free advice for UWA Postgraduate Research Studentsconsulting‐[email protected]

5.4 WAHTN Clinical Trial and Data Management Centre The Clinical Trial and Data Management Centre is a WAHTN enabling platform which aims to enhance clinical trials and related data management in Western Australia. 

The platform is a WAHTN‐wide entity sharing expertise in clinical trial study design (including novel designs), clinical trial conduct, data management, data‐linkage, analytical techniques for clinical trial datasets, bio‐repository techniques and clinical registry datasets. It facilitates the pursuit of large‐scale clinical trials and translational healthcare research in WA. 

https://www.wahtn.org/event/clinical-trial-and-data-management-centre-ctdmc-need-help-with-clinical-research-3/

Contact: [email protected]  08 9266 1970 

21

Page 21: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

Reporting Guidelines ________________________________________________________________________ Estimating intervention effects: CONSORT (randomised clinical trial)

TREND (non-randomised study) ________________________________________________________________________ Assessing causes & prognosis: STROBE (observational studies) ________________________________________________________________________ Quantifying accuracy of STARD (diagnostic studies) diagnosis & prognosis tools: REMARK (tumour markers prognostic studies) ________________________________________________________________________ Testing genetic association: STREGA (genetic observational studies) ________________________________________________________________________ Aggregating evidence: PRISMA (previously known as QUOROM) systematic reviews & meta-analyses [MOOSE, IOM] ________________________________________________________________________

• CONSORT (CONsolidated Standards Of Reporting Trials) Moher D, Hopewell S, Schulz KF, Montori V, Gotzsche PC, Devereux PJ, et al. CONSORT 2010Explanation and Elaboration: updated guidelines for reporting parallel group randomised trials.BMJ 2010;340:c869.

• TREND (Transparent Reporting of Evaluations with Non-randomised Designs) Des Jarlais DC, Lyles C, Crepaz N, TREND Group. Improving the reporting quality ofnonrandomized evaluations of behavioral and public health interventions: the TREND statement. AmJ Public Health 2004;94:361-6.

• STROBE (STrengthening the Reporting of OBservational studies in Epidemiology) Vandenbroucke JP, von Elm E, Altman DG, Gøtzsche PC, Mulrow CD, Pocock SJ, et al; STROBEInitiative. Strengthening the reporting of observational studies in epidemiology (STROBE):explanation and elaboration. PLoS Med 2007;4:1628-54.

• STARD (STAndards for the Reporting of Diagnostic accuracy studies) Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards completeand accurate reporting of studies of diagnostic accuracy: the STARD initiative. BMJ2003;326:41-4.

• REMARK (REporting recommendations for tumour MARKer prognostic studies) McShane LM, Altman DG, Sauerbrei W, Taube SE, Gion M, Clark GM. REporting recommendationsfor tumour MARKer prognostic studies (REMARK). Br J Cancer2005;93:387-91.

• STREGA (STrengthening the REporting of Genetic Associations) Little J, Higgins JP, Ioannidis JP, Moher D, Gagnon F, von Elm E, et al. Strengthening the reportingof genetic association studies (STREGA): an extension of the STROBE statement. Eur JEpidemiol2009;24:37-55.

22

Page 22: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

Understanding & reporting your research results

Additional Resources

Reporting Guidelines (continued)

• PRISMA (preferred reporting items for systematic reviews and meta-analyses) Moher D, Liberati A, Tetzlaff J, Altman DG; PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med2009;6:e1000097. [Previously known as Quality of Reporting of Meta-analyses or QUOROM]

• MOOSE: Meta-analysis Of Observational Studies in Epidemiologyhttp://www.consort-statement.org/Media/Default/Downloads/Other%20Instruments/MOOSE%20Statement%202000.pdf

• IOM: Institutes of Medicine Standards for Systematic Reviewshttps://pubmed.ncbi.nlm.nih.gov/24983062/

Recommended basic statistics textbook

• De Veaux, R.D., Velleman, P.F., and Bock, D. (2012). Stats: Data And Models (3rd Edition), PearsonEducation, Boston.

Good source for how to perform statistical analyses (SAS, SPSS, STATA & R)

• http://www.ats.ucla.edu/stat/• If you can’t find how to perform the statistical analysis on this site then you definitely need to

consult a statistician!

The following pages are from the back pages of “Stats: Data And Models (2nd Edition)” and provide a great overview of common (basic) statistical formulas, methods and associated assumptions.

23

Page 23: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

ernPaJordZoMl ro I ilnoqBdno,r8 | nr,rrrr;o1auo I

raldeqS!ISelEurllsgrelaruErEdIapohtr

LZ

9s , ^1 - -lx - "x), t,gt..fS A"is-1

ni

Z-u:lPj

'H rot lerueluluollclpold+

'rl rol te,ualut eldu.res jnffJrt acuoplluoc+ auo 'orrrl)

--- .- uolle!)ossv6l rol le^lolul

acuapuuoc ro lsol-luotssal6eg :eeut1

-.+-(x-"x) oc a'-Z"

(to)z:sf"lnd

(16olouqcel qlrrvl elnduoc)tqtd

r - c)(r -.,) : lpzx

(selqeuen

lsol zx eldues lecuo6alecacuapuedapul ou6t otv\)

aruapuadaPul

9Zdxt

z@x3 - sqo )olsol zx sdnotb

r{1reue6or.uo;1 }uaPUadoPulruen ,rr,,L"J3iJ1il,",

l-s/lao:,Pzx

i!i{o eldues suollnqFlslo

-ssaupooe ouo

9Z!PgppdL-u:lPj

le^rolul-J PailEd s.lled

lsal, parrEd poqcle!\l u

. sdnolOP!-atll.:19.Y:s--z luapuadepur

suEewlsal-l alou'lEs-z o/v[

ffile^loiul, auo

nzzu tu\9s isl

zi-til "rt-rrtA6olouqcel

ruol, lpj

ezvs

4rli-u=lPI

zzzu+tL1 Zg tu \

- d '- !

-r

!zi+V-Y 84 N,'zd- tdzd-tdz

lsof-zuorlrodold-Z

le^.lslul-zuoluodold-Z

sdnorDluepuedeput

O/vU

suoluodordzz

zg tu ,\zbzd + tbtd i

tz'oz_L\ obod Iqdz

lssf-zuortrodotd- L

eloulese,tr"ll'rl. auo

uou.rodo;6- 1,6r

Ur\tu1 ":)

.ra1deq3:ISalEtlrIls.IralartrErEdIapohlaJnpa)oJdzol\Al rodno.rBauo

zlnoqEaJuararul

.IIAI^^OqS{ulqraruaralul o] ePInD Irln$

24

Page 24: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

(saseartur u su lerrlrrc ssal*)

'(alqrssod I urral uorlreraluruE apnlJur asrMraqlo) sauq 1a11ered s.,r,roqs 1o1d uor])€ralul .I

srolrrpard lerrro8aler Jo sla^al sso.rce asuods;r'"e;* ":."F

yo rjrft"Lg . ('q)Ee u! sra^ar ro Jequnu pue sropeJ Io raqrunu uo spuedap ,p ,1) arue;i6a }o'i;sfileuy

*'l{8rer1s dlquuosear sr 1o1d,(1ilqeqord letu.rog ro ,rr4atuur.{spue Iepowrun,,{lalerurxorddp sr slenprsar yo ruer8olsrg .p .lapou iEurroN p .Mollo} srorrg .t

'pea.rds luelsuo) ,r,{^rnl1-r1p9;rrpard ,r.r,ne'J ffi;J;X1?,T , .ruulsuo) sr srorra }o dlqrqerrun .s 'sanle^ palcrpard lsuru8e slenprsa;;o 1o1d ur rualied luaredde oi{ i .luapuadapur "; J;;;;t ;

s.eloqs sanre^ p-appa.rd lsureeu rrr"';:'jl;t L]I:"tr"; 'q8noua lq8rerls ar' x r{rea lsute8n li yo sloidrallnri '1 -

.rea..,,l sr drqsuoqelar Jo ruroil .IalqprrEl asuodsar aql LIIIM ro1:rpard a,rr1e1,1uenb qJEa Jo'uonerirry . (L - \ * u : Ip,tls.rop;pardy qryn uolssar8ag

IeurroN ro 'crrlarutuds pue Ippourpn sr slenprsar io *ir8o1.rg .7'spea:ds relnurs ,tror{s (sro+)e, 7 ioy slb1i"oq pp;ed) r1o1d*o'[ 'peards ]uelsuor seq sanle^ palrrpaid lsrnesrj ,1r,-rpr.", Jo lolcl *ur *sE+DuvJ DsLl Ddrrlvl\ pd+Jlpd,ru +suleae slenplsal Jo lOId .t

'uoqeznuopueJ alqelrns raLllo ro luaurrradxa pazruopue1 .Z

'rrrlaunu,{s pue Iepourun $ sa)uaJaJJrp;o ruer8olsrll'uoqe)olle uropuer uo %0I > pu€ sus ('u8rsap aql lnoqe {qL11)

*'trrlaunuds pup iepounun a.re stuer8olsrq qlog 'uorle)olle wopuer uo %0I > pup ssus('u8rsap aqt lnoqp {ulqJ)

*'rr4aruurds pqe Iepourun sr tuer8olsrll .uoqelndod aql Jo %0I > pue SUS

'qloqroj 0I < saJnlrpJ pue sassa))ns .uorlerolle uIopuPr uo. suorlelndod p %01> pue sSUS aru r{tog('papa11oc araM elep ar{l .ttor{ 1rtoqn 1rffi

'0I < sarnlreJ pue sassa))ns .zuorlelndod aqt Jo %0I > pue SUS .I

'1q8rerls dyqeuosear sr 1o1d .,finrqeqo.rdJE Ippourrun sr slenprsaJ;o ruer3o1sr11.lapotll ieurroN p

^^olloJ srorrg .,'slalal luau4parl ssor)E aJuerlel lenbE

.g'srorra luapuadapul .7

'(ura1 uoqreralurou r{tr.tr srotreJ Z ar€ araq} Jl) Iapohl a^BIppV .I

'9 = sluno) papadxa gy 'g .a8re1 .(liuaor;1ns sr aldrueg .E

'uo4elndod aql ro %0I > pup sstls .Z .l.rrprniipr, ,r; ,;rd ; (;daq1 ary) 'y- sluno) ar€ etECI .l[salqerre,l o^/rr uo parJrsselr uoqelndod auo uro{ aldues :(r^- r)(t - l) : Jp] "r.."p.r"d-npl1 . '9 < sluno) papadxa gy 'g .JBre1 d11uap1g.rl'nrn ,.1d*rg :g

'uollE)olle tuopu€r >Io %0I > pue sSuS .z .luapuad-apur are-ialdues "l nlr{ .i

. (a{rql arv)_'I_ .s}unor are ptecl .I[alqerre,l auo uo paredruoc suogle1ndod,,(uer.u urorl saldures i(r r)(r _ r) = Jp] dlrauaEouroll o '9 < slunor papadxa 1y .g 'ia3rei kp"arr,j,iJ ..i1a*ng :g

'uorlelndod aq] Jo %0I >-pue SuS .z .luapuadapui'"rn jfh*ri "r r",i ; (idaq1 ary)_ '1 .sluno) ar€ elecl .I

(lapour uoqelndod r{llr, pareduroo aldures auo'alqeupl auo iI - slla) Jo # : ,p) f!} }o ssaupoo! o(zX) uotlppossv/suolmq!4slq

'c'z.I

'€,.Z.I

.Z

.I

'leturoN sr saJuaraJJrp;o uorlelndo4 .g.luapuadapur are slenprlrpul .Z

.s.ted a lpaqJleu are ele6 .Ifi _ u _;p) srred pe{)fe}\t o .leruroN a.re suoqelndod qlog .g

'luapuadapur are aldrues r{rea ur 1116 ; .luapuadapur are saldure5 .1(Abolour.{)al urory 1p) saldures luapuadapul o^aI . .lapou IeurroN e seq uorlelndo4 .7

'luapuadapur are slenpr^rpul .Ifi - u = yp) aldure5 aue o

€ .a8re1.{pua,rgns a.re saldrue, w$}t[n"w.Z'l

'luapuadapur a.re aldrues q)Ea ur e1e1l .Z'luapuadapur are saldr,ueg .1

'ae.:e1 diluo'rrr.. .,"r'rlH[?' 'luapuadapur are slpnpl^rpulrrlO .

(z)suoluodordurarll aprrJalg ro goddns lpql suoqrpuoJ aql puvaJuaJarul ro1 suorldurnssy

25

Page 25: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

ESft/ISlt : {-N't-ryJ

(:t - lt)/sSS : ESIN :r(4 - r,n) KK : 'SS

(r - l)/rss : rsr^ !r(4 - 14) KK : .ss :vAoNV

, or' ,K : rX:arenbs-rq3,ldxT - sqo)s

laur are suoplpuoJ pue suoqdrunsse uar{'\sdno;fi qloq roJ selnruroJ Eg a,tqradsar ar{l uT saleullsa palood asaq} a+nlpsqns

Z-att+ltt A

-/\

: ":sueaur

uaamlaq aluaraJJlp 3ur1sa1 lo4|ttt -.rr) - ts(L - 't:\

-'t-

* tu : , ,', ,rorlrodo,d uaa^raq aluaraJJrp Burlsal rog :Bur1oo4

ch t- th

2t++ "([ - 'x) . (Iq)zusl\

! *.ro-nx\. (tq).zs;\is-:

r--!.L's ,5

v-k(tt Lu A , /\-f-l 9s :sl oLl

(uo_rssar8a-r ,q Waldurs ur)

(uorssar8araldqlmu !-! N:,t ,ouII -{- lSurPr.trP) z(4 - h)<l

^[ uLz"

4li

(uorssarBaraldurrs ur)

(uolssar8a:alduls ur)

tl^Po

zrr Iu i\20 Pl

vo

tl /\ba

4li

^Tl

nPd (

zl _ rli zrl -

t,rl

ys

ilN6ta I ::l

4

'd -t4

d

rl

zd -td

d

(rllsllEls)1s()psHels)os,llsllElsralaruEJEd

:aJuaJalul

(rt|sqa$)OS,rlr*rrrT- tr.lw:

)Ilsqels lsaJ

(ctlst+Dls)Xs x anloa fn4ln 1 Jllsna$ : ralatue'red roJ lelralur a)uaplluoJ

L-zbzd ' rbrd

26

Page 26: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

UA+:6)os 'rt : (4),1

qlt ^t Iap otu IBru J oN aq] s aqr e or d de uorln qr4 q p 3u11 drue s

::ilil:i?Jr'J, !*?lLrl : ont 't : (d),1 i: I

bd"f : o du: rl ,_ubrd')' : (r)a :lerurourg

-dx dll':o :dar

dt ,b : (x)a :)Ir+auroaD

luapuadapur are I pue X JI(.t)rut + (x)na: (1a y)ru1 (,t)s + (x)s : (,t + x)s

(y).ru1ra : (yu)n1(x)na: e a y).u1

(x)az\t -r)K -zo-(X)rut

uDr x g'I + €O < fi rcuDI x 9'I -

(x)za : (xa)t; + (x)A : (r; X)g

(r)a.xK:rt:(fl1(g)a : (vlg)a'luapuadapul ar€ g pue v JI

(v)a '''", :(v1a)aG, pua v)a

(v;a)a x (v)a : (g Pua Y)r1

(spuu V)d - (s)a + (v)a: @,.tov)a

(3v)a-r:(v)a.Yq

xrq - fr - 04 pue -, : tq alar{,tr xrq + oq: A ,hS

I-u ,rr^rK

(paseq eqepl u:--o : zn_fi

o (paseq lapour) d _n: z

L-U A

/[-n.i.l' : s

L :N h<tO > fr:qunqJ-Jo-aln5 rarlln6

tO - sO: UDIulw-xaw:aBuuy

selnruJoJ pel)olas

27

Page 27: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

F or a brief moment in 2010, Matt Motyl was on the brink of scientific glory: he had dis-covered that extremists quite literally see the world in black and white.

The results were “plain as day”, recalls Motyl, a psychology PhD student at the University of Virginia in Charlottesville. Data from a study of nearly 2,000 people seemed to show that political moderates saw shades of grey more accurately than did either left-wing or right-wing extremists. “The hypothesis was sexy,” he says, “and the data provided clear support.” The P value, a common index for the strength of evidence, was 0.01 — usually interpreted as ‘very significant’. Publication in a high-impact journal seemed within Motyl’s grasp.

But then reality intervened. Sensitive to con-troversies over reproducibility, Motyl and his adviser, Brian Nosek, decided to replicate the study. With extra data, the P value came out as 0.59 — not even close to the conventional level of significance, 0.05. The effect had disappeared, and with it, Motyl’s dreams of youthful fame1.

It turned out that the problem was not in the data or in Motyl’s analyses. It lay in the sur-prisingly slippery nature of the P value, which is neither as reliable nor as objective as most scientists assume. “P values are not doing their job, because they can’t,” says Stephen Ziliak, an economist at Roosevelt University in Chicago, Illinois, and a frequent critic of the way statis-tics are used.

For many scientists, this is especially worry-ing in light of the reproducibility concerns. In 2005, epidemiologist John Ioannidis of Stan-ford University in California suggested that most published findings are false2; since then, a string of high-profile replication problems has forced scientists to rethink how they evalu-ate results.

At the same time, statisticians are looking for better ways of thinking about data, to help scientists to avoid missing important informa-tion or acting on false alarms. “Change your statistical philosophy and all of a sudden dif-ferent things become important,” says Steven

Goodman, a physician and statistician at Stan-ford. “Then ‘laws’ handed down from God are no longer handed down from God. They’re actually handed down to us by ourselves, through the methodology we adopt.”

O U T O F C O N T E X TP values have always had critics. In their almost nine decades of existence, they have been lik-ened to mosquitoes (annoying and impossi-ble to swat away), the emperor’s new clothes (fraught with obvious problems that everyone ignores) and the tool of a “sterile intellectual rake” who ravishes science but leaves it with no progeny3. One researcher suggested rechris-tening the methodology “statistical hypothesis inference testing”3, presumably for the acro-nym it would yield.

The irony is that when UK statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the

P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume.

S T A T I S T I C A L E R R O R SB Y R E G I N A N U Z Z O

DA

LE E

DW

IN M

UR

RAY

1 5 0 | N A T U R E | V O L 5 0 6 | 1 3 F E B R U A R Y 2 0 1 4© 2014 Macmillan Publishers Limited. All rights reserved

Page 28: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

old-fashioned sense: worthy of a second look. The idea was to run an experiment, then see if the results were consistent with what random chance might produce. Researchers would first set up a ‘null hypothesis’ that they wanted to disprove, such as there being no correlation or no difference between two groups. Next, they would play the devil’s advocate and, assuming that this null hypothesis was in fact true, cal-culate the chances of getting results at least as extreme as what was actually observed. This probability was the P value. The smaller it was, suggested Fisher, the greater the likelihood that the straw-man null hypothesis was false.

For all the P value’s apparent precision, Fisher intended it to be just one part of a fluid, non-numerical process that blended data and background knowledge to lead to scien-tific conclusions. But it soon got swept into a movement to make evidence-based decision-making as rigorous and objective as possible. This movement was spearheaded in the late 1920s by Fisher’s bitter rivals, Polish math-ematician Jerzy Neyman and UK statistician Egon Pearson, who introduced an alternative framework for data analysis that included sta-tistical power, false positives, false negatives and many other concepts now familiar from introductory statistics classes. They pointedly left out the P value.

But while the rivals feuded — Neyman called some of Fisher’s work mathematically “worse than useless”; Fisher called Neyman’s approach “childish” and “horrifying [for] intel-lectual freedom in the west” — other research-ers lost patience and began to write statistics manuals for working scientists. And because

many of the authors were non-statisticians without a thorough understanding of either approach, they created a hybrid system that crammed Fisher’s easy-to-calculate P value into Neyman and Pearson’s reassuringly rigor-ous rule-based system. This is when a P value of 0.05 became enshrined as ‘statistically sig-nificant’, for example. “The P value was never meant to be used the way it’s used today,” says Goodman.

W H A T D O E S I T A L L M E A N ?One result is an abundance of confusion about what the P value means4. Consider Motyl’s study about political extremists. Most scien-tists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong. The P value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. To ignore this would be like waking up with a headache and concluding that you have a rare brain tumour — possible, but so unlikely that it requires a lot more evidence to supersede an everyday explanation such as an allergic reaction. The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding

is a false alarm, no mat-ter what the P value is.

These are sticky con-cepts, but some stat-isticians have tried to

provide general rule-of-thumb conversions (see ‘Probable cause’). According to one widely used calculation5, a P value of 0.01 cor-responds to a false-alarm probability of at least 11%, depending on the underlying probabil-ity that there is a true effect; a P value of 0.05 raises that chance to at least 29%. So Motyl’s finding had a greater than one in ten chance of being a false alarm. Likewise, the probability of replicating his original result was not 99%, as most would assume, but something closer to 73% — or only 50%, if he wanted another ‘very significant’ result6,7. In other words, his inability to replicate the result was about as surprising as if he had called heads on a coin toss and it had come up tails.

Critics also bemoan the way that P values can encourage muddled thinking. A prime example is their tendency to deflect attention from the actual size of an effect. Last year, for example, a study of more than 19,000 people showed8 that those who meet their spouses online are less likely to divorce (p < 0.002) and more likely to have high marital satisfaction (p < 0.001) than those who meet offline (see Nature http://doi.org/rcg; 2013). That might have sounded impressive, but the effects were actually tiny: meeting online nudged the divorce rate from 7.67% down to 5.96%, and barely budged happiness from 5.48 to 5.64 on a 7-point scale. To pounce on tiny P values and ignore the larger question is to fall prey to the “seductive certainty of significance”, says Geoff Cumming, an emeritus psychologist at La Trobe University in Melbourne, Australia. But significance is no indicator of practical relevance, he says: “We should be asking,

PROBABLE CAUSE A P value measures whether an observed result can be attributed to chance. But it cannot answer a researcher’s real question: what are the odds that a hypothesis is correct? Those odds depend on how strong the result was and, most importantly, on how plausibile the hypothesis is in the �rst place.

11% chance of real e�ect

5% chance of real e�ect

30% 70%

Before the experimentThe plausibility of the hypothesis — the odds of it being true — can be estimated from previous experiments, conjectured mechanisms and other expert knowledge. Three examples are shown here.

The measured P valueA value of 0.05 is conventionally deemed ‘statistically signi�cant’; a value of 0.01 is considered ‘very signi�cant’.

T H E LO N G S H OT19-to-1 odds against

T H E TO S S- U P1-to-1 odds

T H E G O O D B E T9-to-1 odds in favour

After the experimentA small P value can make a hypothesis more plausible, but the di�erence may not be dramatic.

95% chance of no real e�ect

89% chance of no real e�ect

71% 29%

50% 50%

89% 11% 96% 4%

90% 10%

99% 1%

P = 0.01P = 0.05 P = 0.01P = 0.05 P = 0.01P = 0.05

Chance of real e�ectChance of no real e�ect

R. N

UZZ

O; S

OU

RC

E: T

. SEL

LKE

ET A

L. A

M. S

TAT.

55

, 62–7

1 (2001)

NATURE.COMFor more on statistics, see:go.nature.com/xlj9lr

1 3 F E B R U A R Y 2 0 1 4 | V O L 5 0 6 | N A T U R E | 1 5 1

FEATURE NEWS

© 2014 Macmillan Publishers Limited. All rights reserved

Page 29: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

‘How much of an effect is there?’, not ‘Is there an effect?’”

Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. “P-hacking,” says Simonsohn,“is trying multiple things until you get thedesired result” — even unconsciously. It may be the first statistical term to rate a definition in the online Urban Dictionary, wherethe usage examples are telling: “Thatfinding seems to have been obtainedthrough p-hacking, the authors droppedone of the conditions so that the overall p-value would be less than .05”, and “She is a p-hacker, she always monitors datawhile it is being collected.”

Such practices have the effect of turn-ing discoveries from exploratory studies — which should be treated with scep-ticism — into what look like sound confirmations but vanish on repli-cation. Simonsohn’s simulations have shown9 that changes in a few data-analysis decisions can increase the false-positive rate in a single study to 60%. P-hacking is espe-cially likely, he says, in today’s environment of studies that chase small effects hidden in noisy data. It is tough to pin down how wide-spread the problem is, but Simonsohn has the sense that it is serious. In an analysis10, he found evidence that many published psychol-ogy papers report P values that cluster suspi-ciously around 0.05, just as would be expected if researchers fished for significant P values until they found one.

N U M B E R S G A M EDespite the criticisms, reform has been slow. “The basic framework of statistics has been virtually unchanged since Fisher, Neyman and Pearson introduced it,” says Goodman. John Campbell, a psychologist now at the University of Minnesota in Minneapolis, bemoaned the issue in 1982, when he was editor of the Journal of Applied Psychology: “It is almost impossible to drag authors away from their p-values, and the more zeroes after the decimal point, the harder people cling to them”11. In 1989, when Kenneth Rothman of Boston University in Massachusetts started the journal Epidemiol-ogy, he did his best to discourage P values in its pages. But he left the journal in 2001, and P values have since made a resurgence.

Ioannidis is currently mining the PubMed database for insights into how authors across many fields are using P values and other sta-tistical evidence. “A cursory look at a sample of recently published papers,” he says, “is convincing that P values are still very, very popular.”

Any reform would need to sweep through an entrenched culture. It would have to change

how statistics is taught, how data analysis is done and how results are reported and inter-preted. But at least researchers are admitting that they have a problem, says Goodman. “The wake-up call is that so many of our published findings are not true.” Work by researchers such as Ioannidis shows the link between theoretical statistical complaints and actual difficulties, says Goodman. “The problems that statisticians have predicted are exactly what we’re now seeing. We just don’t yet have all the fixes.”

Statisticians have pointed to a num-ber of measures that might help. To

avoid the trap of thinking about results as significant or not significant, for exam-

ple, Cumming thinks that researchers should always report effect sizes and confidence inter-vals. These convey what a P value does not: the magnitude and relative importance of an effect.

Many statisticians also advocate replacing the P value with methods that take advantage of Bayes’ rule: an eighteenth-century theorem that describes how to think about probability as the plausibility of an outcome, rather than as the potential frequency of that outcome. This entails a certain subjectivity — some-thing that the statistical pioneers were trying to avoid. But the Bayesian framework makes it comparatively easy for observers to incorpo-rate what they know about the world into their conclusions, and to calculate how probabilities change as new evidence arises.

Others argue for a more ecumenical approach, encouraging researchers to try mul-tiple methods on the same data set. Stephen Senn, a statistician at the Centre for Public Health Research in Luxembourg City, likens this to using a floor-cleaning robot that can-not find its own way out of a corner: any data-analysis method will eventually hit a wall, and some common sense will be needed to get the process moving again. If the various methods come up with different answers, he says, “that’s a suggestion to be more creative and try to find out why”, which should lead to a better under-standing of the underlying reality.

Simonsohn argues that one of the strongest protections for scientists is to admit every-thing. He encourages authors to brand their papers ‘P-certified, not P-hacked’ by includ-ing the words: “We report how we deter-mined our sample size, all data exclusions (if any), all manipulations and all measures

in the study.” This disclosure will, he hopes, discourage P-hacking, or at least alert readers to any shenan igans and allow them to judge accordingly.

A related idea that is garnering attention is two-stage analysis, or ‘preregistered replication’, says political scientist and statistician Andrew Gelman of Columbia University in New York City. In this approach, exploratory and con-firmatory analyses are approached differently and clearly labelled. Instead of doing four sepa-rate small studies and reporting the results in

one paper, for instance, researchers would first do two small exploratory studies and gather potentially interesting find-ings without worrying too much about false alarms. Then, on the basis of these results, the authors would decide exactly how they planned to confirm the find-ings, and would publicly pre register their intentions in a database such as the Open Science Framework (https://osf.io). They would then conduct the replication stud-

ies and publish the results alongside those of the exploratory studies. This approach allows for freedom and flexibility in analyses, says Gel-man, while providing enough rigour to reduce the number of false alarms being published.

More broadly, researchers need to realize the limits of conventional statistics, Goodman says. They should instead bring into their analysis elements of scientific judgement about the plausibility of a hypothesis and study limita-tions that are normally banished to the dis-cussion section: results of identical or similar experiments, proposed mechanisms, clinical knowledge and so on. Statistician Richard Royall of Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland, said that there are three questions a scientist might want to ask after a study: ‘What is the evidence?’ ‘What should I believe?’ and ‘What should I do?’ One method cannot answer all these questions, Goodman says: “The numbers are where the scientific discussion should start, not end.” ■ SEE EDITORIAL P. 131

Regina Nuzzo is a freelance writer and an associate professor of statistics at Gallaudet University in Washington DC.

1. Nosek, B. A., Spies, J. R. & Motyl, M. Perspect.Psychol. Sci. 7, 615–631 (2012).

2. Ioannidis, J. P. A. PLoS Med. 2, e124 (2005).3. Lambdin, C. Theory Psychol. 22, 67–90 (2012).4. Goodman, S. N. Ann. Internal Med. 130, 995–1004

(1999).5. Goodman, S. N. Epidemiology 12, 295–297 (2001).6. Goodman, S. N. Stat. Med. 11, 875–879 (1992).7. Gorroochurn, P., Hodge, S. E., Heiman, G. A.,

Durner, M. & Greenberg, D. A. Genet. Med. 9,325–321 (2007).

8. Cacioppo, J. T., Cacioppo, S., Gonzagab, G. C., Ogburn, E. L. & VanderWeele, T. J. Proc. Natl Acad.Sci. USA 110, 10135–10140 (2013).

9. Simmons, J. P., Nelson, L. D. & Simonsohn, U.Psychol. Sci. 22, 1359–1366 (2011).

10. Simonsohn, U., Nelson, L. D. & Simmons, J. P. J. Exp.Psychol. http://dx.doi.org/10.1037/a0033242(2013).

11. Campbell, J. P. J. Appl. Psych. 67, 691–700 (1982).

“ T H E P V A L U E W A S N E V E R M E A N T T O B E U S E D T H E W A Y I T ’ S

U S E D T O D A Y .”

1 5 2 | N A T U R E | V O L 5 0 6 | 1 3 F E B R U A R Y 2 0 1 4

FEATURENEWS

© 2014 Macmillan Publishers Limited. All rights reserved

Page 30: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

Research Skills Seminar Series 2020 CAHS Research Education Program

Actions required for research protocols and analysis plans impacted by COVID-19

Friday, 12 June 2020 | 12:30 – 1:30PM

The evolving situation under COVID-19 requires pragmatic solutions for research, as noted in newly developed international guidelines addressing the rights, safety and wellbeing of participants. This seminar will cover common problems investigators are facing during the pandemic, suggest how solutions can be documented, and proposes statistical solutions to minimise bias. The seminar will focus on changes that may be required for protocols, analysis plans and Data Safety & Monitoring Committees.

*Online access via Scopia available

*Hosted VC Sites Include: CAHS – Community Health (WASON)

Fiona Stanley Hospital Joondalup Health Campus

Lions Eye Institute Royal Perth Hospital

Further information:

E: ResearchEducationProgram @health.wa.gov.au

W: cahs.health.wa.gov.au/Research/

Go to “For Researchers” then “Research Education Program” https://cahs.health.wa.gov.au/en/Research/For-researchers/Research-Education-Program

The Research Skills Seminar Series is part of the Research Education Program, Department of Child Health Research, Child and Adolescent Health Service, WA Department of Health. Seminars are hosted by WA Department of Health.

Perth Children’s Hospital PCH Auditorium

Level 5 (Pink or Yellow lifts)

15 Hospital Ave, Nedlands

Register Online

ResearchEducationProgram .eventbrite.com

Dr Julie Marsh

Julie is an experienced statistical consultant who worked in the pharmaceutical industry for many years before returning to academia and teaching at UWA. She is a Senior Research Fellow at the Telethon Kids Institute where she is a specialist in adaptive trials.

Page 31: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

CAHS Research Education Program | Research Skills Seminar Series 2020

All sessions held on Friday 12:30 – 1:30pm, Perth Children’s Hospital Auditorium

Date Topic (abbreviated titles) Presenter

1 Feb 14 Research Fundamentals: Question and Protocol Development A/Prof Sue Skull

2 Feb 21 Research Governance Sunalene Devadason Helen Hughes

3 Mar 6 Introduction to Good Clinical Practice Natalie Barber

4 Mar 13 Scientific Writing A/Prof Sue Skull

5 Mar 27 Building your Personal Brand as a Researcher: How to get started and maintain momentum on social media Dr Kenneth Lee

6 May 1 Survey Design and Techniques – 2019 version made available A/Prof Sue Skull

8 May 15 Introduction to Adaptive Trials Prof Tom Snelling

9 May 18 Coordination of COVID-19 Research in WA Prof Gary Geelhoed

May 22 Knowledge Translation – 2019 version made available Fenella Gill/Tobias Schoep

10 Jun 5 Introductory Biostatistics: Understanding and reporting research results including P-Vales and Confidence Intervals Dr Julie Marsh

11 Jun 12 Information for Clinical Triallists during COVID-19 Dr Julie Marsh

12 Jun 26 Consumer and Community Involvement Anne McKenzie AM

13 Jul 31 Data Collection and Management A/Prof Sue Skull

14 Aug 14 Oral Presentation of Research Results A/Prof Sue Skull

15 Aug 28 Media and Communications in Research Elizabeth Chester

16 Sep 11 Conducting Systematic Reviews Prof Sonya Girdler

17 Sep 18 Involving the Aboriginal Community in Research Glenn Pearson A/Prof Sue Skull

18 Oct 16 Rapid Critical Appraisal of Scientific Literature A/Prof Sue Skull

19 Oct 23 Sample Size Calculations Dr Julie Marsh

20 Nov 13 Grant Applications and Finding Funding A/Prof Sue Skull

21 Nov 27 Qualitative Research Methods Dr Shirley McGough

22 Dec 4 Ethics Processes for Clinical Research in WA A/Prof Sue Skull Topics may be subject to change; email notice will be provided. All seminars and corresponding handouts are regularly revised and updated. Attendance certificates are available upon request.

E: [email protected] W: CAHS Research Education Program (external site)

Page 32: Senior Research Fellow, Telethon Kids Institute · 2020. 7. 6. · Observational vs Experimental Design An observational study draws inferences from asampleto a population, where

https://cahs.health.wa.gov.au/ResearchEducationProgram/