harmonizing statistical evidences and predictions
DESCRIPTION
Bad reproducibility of experimental results becomes a systemic problem in biomedicine. One of the main reason of this is inadequate statistical analysis. Statistical analysis should be comprehensive harmonizing statistical evidences and predictions as well as frequentist and Bayesian approaches. It is insufficient to carry out the null hypothesis significance testing (NHST) reporting P-values. Statistical significance doesn’t mean clinical importance. Effect size with confidence and prediction intervals should be reported. Experiments an/or observations should be repeated many-many times and their agreement should be investigated. The best way is to repeat the experiments independently in different laboratories (in different countries).TRANSCRIPT
International Life Sciences Workshop “Decision-Making in Biomedical Science – Meet Experts”
September 12 – 16 | 2014 Potsdam | Germany
Harmonizing statistical evidences and predictions
Nikita N. Khromov-Borisov
Pavlov First Saint Petersburg State Medical University Saint Petersburg, Russia
[email protected] +7 952-204-89-49; +7 921-449-29-05
http://independent.academia.edu/NikitaKhromovBorisov https://www.researchgate.net/profile/Nikita_Khromov-Borisov?ev=hdr_xprf
1
Slides are freely available to all
Nikita N. Khromov-Borisov Department of Physics, Mathematics and Informatics
Pavlov First Saint Petersburg State Medical University
+7-952-204-89-49; +7-921-449-29-05 http://independent.academia.edu/NikitaKhromovBorisov
2
The best way to discuss scientific issues is to discuss them in a foreign language
Max Ludwig Henning Delbrück, (September 4, 1906 – March 9, 1981)
Piotr Slonimski (November 9, 1922 – April 25, 2009)
3
Second hand teaching
• The History of Science has suffered greatly from the use by teachers of second-hand material, and the consequent obliteration of the circumstances and the intellectual atmosphere in which the great discoveries of the past were made.
• A first-hand study is always instructive, and often . . . full of surprises.
• Ronald A. Fisher, 1955 • Cited by: Ziliak S.T., McCloskey D.N. The Cult of Statistical
Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. The University of Michigan Press, Ann Arbor, 2008, 321 pp.
• http://stephentziliak.com/
4
Crisis of reproducibility of the results in biomedicine
5
The essences of science are replication and reproducibility
• The essence of science is replication: • a scientist should always be concerned about what would
happen if he or another scientist were to repeat his experiment.
• Guttman L. What is not what in statistics. The Statistician, 1977; 26(2): 81-107.
• Scientists have elaborated method of determining the validity of their results.
• They learned to ask the question: are they reproducible? • Scherr G.H. Irreproducible Science: Editor’s Introduction. • In The Best of the Journal of Irreproducible Results,
Workman • Publishing, New York, 1983. • Reproducibility is like the ghost that will always come back
to haunt you. • http://datapede.blogspot.ru/2014/03/part-1z-p-value-surviving-mosquito.html
6
Loscalzo J. Irreproducible Experimental Results: Causes, (Mis)interpretations, and Consequences. Circulation, 2012; 125: 1211-1214.
• In Science what is relevant is reproducible results.
• If an initial observation is found to be reproducible, then it must be true.
• If an initial observation is found not to be reproducible, then it must be false.
• Many readers of scientific journals—especially of higher-impact journals—assume that if a study is of sufficient quality to pass the scrutiny of rigorous reviewers, it must be true.
• This assumption is based on the inferred equivalence of reproducibility and truth.
7
• Long ago Fisher . . . recognised that . . . solid knowledge came from a demonstrated ability to repeat experiments . . .
• This is unhappy for the investigator who would like to settle things once and for all, but consistent with the best accounts . . . of the scientific method . . .
• Tukey J.W. The philosophy of multiple comparisons. Statistical Science, 1991; 6: 100-116.
8
Tukey J.W. Analyzing data: Sanctification or detective work? American Psychologist, 1969; 24: 83–91.
• Nothing learned is certain. • We learn by taking chances. • Every modern learning theorist expects learning to be by trial,
with some errors. • This is as true for science as for the individual. • Confirmation comes from repetition. • Repetition is the basis for judging varilability and significance and
confidence. • Repetition of results, each significant, is the basis, according to
Fisher, of scientific truth. • Certainty is an illusion. • As an illusion, certainty can be wasteful, as well as misleading. • Data analysis needs to be both exploratory and confirmatory.
9
From the history of epidemiological studies: Risk factors for cancer [Jenks S., Volkers N. Razors and Refrigerators and Reindeer — Oh My!
JNCI, 1992; 84(24):1863]
• Using electric razor: Increase the risk of developing leukemia.
• Distal forearm fractures in women: Reduction in overall cancer incidence, breast cancer incidence, and incidence of tumors.
• Fluorescent lighting: Melanoma in male but not in females.
• Allergies and cancer: At first the inverse relationship. Later several types of cancer were elevated. However, ovarian cancer risk decreased with increasing numbers of allergies.
• Breeding reindeer: in Swedish Lapps decreased risks for cancers of the colon, female breast, male genital tract, kidneys, respiratory system, and for lymphomas. However, increased risk for stomach cancer.
10
From the history of epidemiological studies: Risk factors for cancer [Jenks S., Volkers N. Razors and Refrigerators and Reindeer — Oh My! JNCI,
1992; 84(24): 1863]
• Waiters in Norway: Decreased risk of stomach cancer but excess risks of cancers of the liver, rectum, upper respiratory and digestive tracts, and lung. Higher mortality rate from lung cancer.
• Owning a pet bird: Fourfold increase in lung cancer risk among pigeon fanciers (more hazardous than living with a smoker). Owners of budgies, canaries, finches, or parrots were OK.
• Height: Lower risks for some cancers in short men, particularly colorectal cancer, and lower risks for this cancer and for breast cancer in short women. But being tall may confer some advantage for certain cancers (esophageal, endometrial and cervical), while tall men have only a slightly elevated risk for prostate, kidney and colon cancers.
• Refrigerators: Seems protect everyone from stomach cancer.
11
• An extensive list of curious and questionable medical observations about the various risk factors, was given in the work:
• Buchanan A.V., Weiss K.M., Fullerton S.M.
• Dissecting complex disease: the quest for the Philosopher’s Stone?
• International Journal of Epidemiology 2006. – Vol. 35. – P. 562–571
12
Table of irreproducible results?
• Hormone replacement therapy and heart disease
• Hormone replacement therapy and cancer • Stress and stomach ulcers • Annual physical checkups and disease
prevention • Behavioural disorders and their cause • Diagnostic mammography and cancer
prevention • Breast self-exam and cancer prevention • Echinacea and colds • Vitamin C and colds • Baby aspirin and heart disease prevention • Dietary salt and hypertension • Dietary fat and heart disease • Dietary calcium and bone strength • Obesity and disease • Dietary fibre and colon cancer • The food pyramid and nutrient RDAs • Cholesterol and heart disease • Homocysteine and heart disease
• Inflammation and heart disease • Olive oil and breast cancer • Fidgeting and obesity • Sun and cancer • Mercury and autism • Obstetric practice and schizophrenia • Mothering patterns and schizophrenia • Anything else and schizophrenia • Red wine (but not white, and not grape juice)
and heart disease • Syphilis and genes • Mothering patterns and autism • Breast feeding and asthma • Bottle feeding and asthma • Anything and asthma • Power transformers and leukaemia • Nuclear power plants and leukaemia • Cell phones and brain tumours • Vitamin antioxidants and cancer, aging • HMOs and reduced health care cost • HMOs and healthier Americans • Genes and you name it!
13
‘Blood group mythology’: myths about AB0
• Human blood group system AB0 can serve as an classic example of unacknowledged associations with the different conditions.
• Several incredible phenomenon were reported:
• Persons with A have more severe hangovers;
• Persons with B defecate the most;
• Persons with 0 have more healthy teeth;
• Military with 0 are spineless and with B are more impulsive;
• Persons with B are more prone to crime;
• Strong connection between AB0 and nutrition;
• Persons with A2 have the highest IQ;
• A is significantly more common among members of the higher socio-economic groups.
• All these associations are not reproduced and virtually forgotten.
14
• Large companies in Japan still use blood types when advertising for, or evaluating, job applicants.
• George Garratty
• Association of Blood Groups and Disease: Do Blood Group Antigens and Antibodies Have a Biological Role?
• History and Philosophy of the Life Sciences, 1996; Vol. 18, No. 3, The First Genetic Marker, p. 321-344.
15
• The only associations between AB0 blood groups and malignant neoplasms, thrombosis, peptic ulcers, bleeding, bacterial and viral infections are still regarded as statistically “proven“.
• Alas, these associations have no clinical (practical) importance due to low values of odds ratio (OR) which do not exceed the value of OR = 1.5.
16
Associations between AB0 blood groups and diseases, which are still considered to be statistically “proven”
Medical condition A > 0 0 > A B/AB > A/0 OR
Malignancy X 1.2 – 1.3
Thrombosis X
Peptic ulcers X 1.2 – 1.4
Bleeding X 1.5
E. coli / Salmonella X
17
Note that here we meet extremely important issue of clinical (or any other practical) importance (significance) of the observed associations. Here clinical importance is demonstrated with one of the measures of the effect size such as odds ratio (OR).
Begley C.G., Ellis L.M. Raise standards for preclinical cancer research. Nature, 2012; 483: 531-533.
• Recently Glenn Begley, former vice president of the well-known biotech company Amgen, and his colleague Lee Ellis published the results of their efforts to replicate findings from recent publications in the clinical oncology literature.
• The data were disturbing.
• Of 53 papers, only 6 (11%) were reproducible.
• Begley and Ellis state that the
• poor reproducibility of the results becomes a systemic problem of modern science.
• In one study, which was cited in a short period more than 1900 times, even the authors themselves later were unable to reproduce their own results.
18
Increasing replication of un-reproducibility in science
• Gautam Naik: Scientists' Elusive Goal: Reproducing Study Results. The Wall Street Journal, December 2, 2011.
• This is one of medicine’s dirty secrets:
• Most results, including those that appear in top-flight peer-reviewed journals, can’t be reproduced.
19
Macleod M.R., Michie S., Roberts I., Dirnagl U., Chalmers I., Ioannidis J.P.A., Al-Shahi Salman R., Chan A.-W., Glasziou P. Biomedical research: increasing
value, reducing waste. The Lancet, 2014, 383(9912): 101-104
• Of 1575 reports about cancer prognostic markers published in 2005, 1509 (96%) detailed at least one significant prognostic variable.
• However, few identified biomarkers have been confirmed by subsequent research and few have entered routine clinical practice.
• This pattern — initially promising findings not leading to improvements in health care — has been recorded across biomedical research.
• So why is research that might transform health care and reduce health problems not being successfully produced?
20
Ioannidis J.P.A.
Why most published research findings are false.
PLoS Med., 2005. – Vol. 2. – No. 8. – Paper: e124.
Cited by 2174
21
Reproducibility Initiative http://validation.scienceexchange.com/#/
22
• PLOS ONE Launches Reproducibility Initiative
• http://validation.scienceexchange.com/#/
• Reproducibility Initiative receives $1.3M grant to validate 50 landmark cancer studies
• Reproducibility Project: Psychology
• https://osf.io/ezcuj/wiki/home/
• Special Section on Replicability in Psychological Science
• Perspectives on Psychological Science, 2012; 7(6): 528 –530
23
• Journal of Negative Results in BioMedicine is an open access, peer-reviewed, online journal that provides a platform for the publication and discussion of unexpected, controversial, provocative and/or negative results in the context of current tenets.
• Editor-in-Chief
• Bjorn R Olsen, Harvard Medical School
24
Challenges in irreproducible research
• No research paper can ever be considered to be the final word, and the replication and corroboration of research results is key to the scientific process.
• In studying complex entities, especially animals and human beings, the complexity of the system and of the techniques can all too easily lead to results that seem robust in the lab, and valid to editors and referees of journals, but which do not stand the test of further studies.
• http://www.nature.com/nature/focus/reproducibility/index.html
25
Statistics
“A subject which most statisticians find difficult but in which nearly all
physicians are expert.”
26
• Statistical flaws are a major cause of irreproducible
results in all types of biomedical experimentation.
• These include errors in trial design, data analysis, and
data interpretation.
• “If experimentation is the Queen of the sciences,
surely statistical methods must be regarded as the
Guardian of the Royal Virtue.”
• Myron Tribus
(Letter to Science)
27
Statistical Babel
• Unfortunately, statisticians speak different languages , and often do not hear and/or do not understand each other.
• Two main approaches to the statistical inference are developing:
• Bayesian and
• Frequentist
• Frequentist inference is subdivided onto two main branches:
• Fisherian and
• Neyman-Pearsonian
• Users do not always differentiate them that leads to serious confusions.
• Two other approaches are also exist: Likelihood and Fiducial inferences.
• http://en.wikipedia.org/wiki/Frequentist_inference
28
Babel
29
Fundamental statistics principles
• Random sampling is the main principle of statistics.
• Randomness and the Law of Large Numbers ensure the sample representativeness.
• A sample is called representative if it reflects correctly the distribution from which the sample is taken.
• The main objective of statistics consists in analyzing random samples to get conclusions on the distributions from which they are drawn.
• Note that we do not need the term “population” which can be misleading.
30
Statistics with confidence
• Does Statistics enable us to trust to it?
• For instance, how to check is the die perfect (fair, ideal, symmetric) or not?
• The answer is provided by the Law of Large Numbers.
31
Simulation of the rolling a die: program SUStats http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html
32
A die was rolled 100 times in each of four independent simulations. Please, answer three questions: 1. Are the results of the rolling reproducible (i.e. are the histograms similar)? - Yes - No 2. What a form (shape) of the histogram and the underlying distribution we expect
for the results of rolling fair die? - Unimodal of a bell-form - Triangle - Uniform (rectangular) 3. Can we state that the die is fair? - Yes - No
Simulation of the rolling a die: program SUStats http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html
33
A die was rolled 1 000 times in each of four independent simulations. Please, answer two questions: 1. Are the results of the rolling reproducible (are the histograms similar)? - Yes - No 2. Can we state that the die is certainly fair? - Yes - No
Simulation of the rolling a die: program SUStats http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html
34
A die was rolled 10 000 times in each of four independent simulations. Please, answer two questions: 1. Are the results of the rolling reproducible (are the histograms similar)? - Yes - No 2. Can we state that the die is certainly fair (the histograms are certainly
rectangular and the entire distribution is uniform)? - Yes - No
Simulation of the rolling a die: program SUStats http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html
35
Pease, keep in mind the last figure (number) n = 10 000 that gives reliable results. It is difficult to realize it in biomedicine, but it’s really reliable.
Lyrical digression
• If to ponder, it is the
• Pauli exclusion principle
• that provides a variety of forms
• of matter at all levels,
• from atoms to living beings,
• e.g., genetic and phenotypic (biochemical, physiological, morphological) variations.
36
Sample size
“She thought that a smaller sample size makes for more accurate results”
37
Sample sizes in physics, chemistry, biology and medicine
• Physicists and chemists works with the samples of different substances which contain 6∙1023 (the Avogadro constant) of particles (atoms or molecules) in 1 mole of the pure substance.
• Even 1 nanomole of given substance contains about 1014 such particles.
• These particles may be regarded as rather identical.
• However, we need not to forget that even on the atomic level there are several isotopes of a given chemical element.
• And some of them are radioactive.
• In medicine researchers are limited with the size of the world population which is less then 1010, specifically, about 7.257∙109.
• See real-time: http://www.worldometers.info/world-population/
• And human population are extremely heterogeneous.
38
Principal contradiction
• All people are dissimilar, even monozygotic (“identical”) twins.
• In such twins the differences in copy number variation (CNV), immunoglobulins, fingerprints are observed.
• Surely this fact is one of the main sources of the low reproducibility and predictive ability of the results in biomedicine.
• Thus, the genetic and phenotypic uniqueness of each person comes into contradiction with the statistical methodology, which requires to analyze large amounts (thousands or at least hundreds) of identical persons to achieve the certain conclusions.
39
What is the Low of Large Numbers? • If the probability P(A) of an event A is constant in all trials, then the larger n -
the number of trials (experiments, sample size),
• the closer the observed (empirical, experimental) relative frequency, f(A), of a given outcome (event) A converges to its expected (theoretical) probability P(A):
• This means that the frequencies become more and more stable and their
fluctuations become smaller and smaller.
• Corollary:
• Thus, we may not know the probability of an event A, but repeating the trial as much as possible, we can accept its observed frequency f(A) as a reliable statistical estimate of the unknown probability P(A)unkn.
• Statistics helps us to know the unknown.
• In Probability Theory probabilities are known, Statistics estimate them.
40
n
APAfP
“Reverse side” of the Law of Large Numbers
• Simultaneously along with the convergence of the frequency of an event A to its probability, the situation, when the frequency of the event will coincide exactly with its probability:
• becomes less probable
• i.e. the larger the number of trials the closer the probability of such an exact match converges to zero:
41
APAf
n
APAfPrP 0
Probability of the exact coincidence of the frequency f(A) with the probability P(A), e.g., fair coin tossing with P(A) = φ = 0,5
f(A) • 5/10
• 50/100
• 500/1 000
• 5 000/10 000
• 50 000/100 000
• 500 000/1 000 000
P[f(A)] • 0.25
• 0.080
• 0.025
• 0.0080
• 0.0025
• 0.00080
42
For the sake of clarity, the probability values are rounded to two significant figures.
Consequences of the Law of Large Numbers (LLN)
• According to the Law of Large Numbers the larger the Sample Size,
• the “better” (more accurate, more reliable) the Sample data reflects the distribution of Random Variable from which the Sample is drawn.
• Consequently, the larger the sample size, the more representative is the Sample.
• This is true, however, if and only if (iff) the Sample data are the realizations of the independent identically distributed (iid) Random Variables.
n
43
Statistical estimation
44
What are the main objectives of statistics?
• Statistical Estimation (of the parameters)
• Point and interval estimations
• Statistical Inference
– Testing Statistical Hypotheses
– Comparison of Models
• Statistical Associations
• Correlation and Regression
45
What is Estimator and what is Estimate?
• An “Estimator“ is a statistic that is used to infer the value of an unknown parameter in a statistical model.
• The parameter being estimated is sometimes called the estimand.
• In other words, an estimator is a rule for calculating an estimate of a given quantity based on observed data:
• thus the rule and its result (the estimate) are distinguished.
46
Two main kinds of Statistical Estimates
• Point Estimate – estimation by a single number.
• Intreval Estimate – estimation by an interval, which covers the value of the estimated parameter with given probability called confidence level.
47
The main logic of Statistical Estimation: Point Estimates
• Usually the parameter φunkn is unknown.
• The objective is to estimate it on the basis of observed statistical data
• x1, x2, …, xi, …, xn.
• The above values are regarded as realizations of corresponding iid random variables:
• X1, X2, …, Xi, …, Xn.
• Appropriate function of these random variable is chosen as an Estimator for the unknown parameter.
• Any such function is called “Statistic” and it also is a random variable.
• Calculated values of a chosen Estimator are called Estimates.
• Estimate is regarded as a realizations of given Estimator.
48
Compression of statistical information
• One of the most widely used statistic is a sample mean which plays a role of the Estimate of the mean value of the underlying distribution.
• It is calculated as:
• And it is generated by the Estimator:
• Here tilde “~” is a symbol of a random variable.
n
i
ixn
M1
1
n
i
iXn
M1
1 ~~
49
Example 1
Intrauterine growth restriction (IUGR) and interferon IFN-α/β
50
• Let consider one of the most common problem of statistical analysis of two independent samples.
51
IUGR – intrauterine growth restriction (old name “intrauterine growth retardation”)
• Foetuses of birth weight less than 10th percentile of those born at same gestational age
• or
• two standard deviations below the population mean are considered growth restricted.
• Note that the difiniton is based on statistical terms: 10th percentile and/or standard deviations.
• More strictly IUGR should refer to foetuses that are small for gestational age and display other signs of chronic hypoxia or failure to thrive.
• Approximately 3-5% of all pregnancies.
• IUGR also known as SGA (small for gestational age).
52
A comparision between normal and IUGR babies (Dr. M.C. Bansal)
53
IUGR
54
Normal and IUGR placenta (Dr. M.C. Bansal)
55
56
Levels of induced production of INF-α/β in 16 healthy mothers of healthy newborns and in 20 mothers of newborns with IUGR
(intrauterine growth restriction) (Koroleva L.I.). Data are ranked.
Healthy IUGR
Rank IFN-α/β,
IU/ml Rank
IFN-α/β, IU/ml
Rank IFN-α/β,
IU/ml Rank
IFN-α/β, IU/ml
1 38 9 92 1 104 11 144
2 42 10 93 2 121 12 146
3 58 11 94 3 123 13 147
4 59 12 101 4 123 14 149
5 70 13 103 5 127 15 151
6 71 14 115 6 130 16 153
7 81 15 159 7 132 17 162
8 86 16 170 8 134 18 168
9 134 19 171
10 140 20 173
Only three highlighted values in healthy group are overlapped with the values in IUGR group. Level of INF-a/b in IUGR group stochastically dominates that in healthy.
Exploratory and Pictorial Statistics. Visualization of the initial data and
their preliminary statistical descriptions:
histograms, box plots, dominance diagrams, etc.
57
58
Comparisons of histograms for the levels of induced production of INF-α/β in 16 healthy mothers of healthy newborns and in 20
mothers of newborns with IUGR. Free program PAST
http://folk.uio.no/ohammer/past
Comparisons of histograms and cumulative sample distributions for the levels of induced production of INF-α/β in 16 healthy mothers of healthy newborns and in 20
mothers of newborns with IUGR. Program XLSTAT http:\\www.xlstat.com
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200
Cu
mu
lati
ve r
ela
tive
fre
qu
en
cy
Cumulative distributions (Healthy / IUGR)
Healthy IUGR
0
0.005
0.01
0.015
0.02
0.025
0 50 100 150 200
De
nsi
ty
IFN-a/b, IU/mL
Histograms (IFN-a/b, IU/mL)
Healthy Normal(89.500,36.471)
IUGR Normal(141.600,18.323)
59
CDF – cumulative distribution functions and stochastic dominance
Program XLSTAT http:\\www.xlstat.com
• The level of induced IFN-a/b in IUGR patients (green line) stochastically dominates that for healthy mothers (blue line):
• X2 > X1
• Stochastic - randomly determined; having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200
Cu
mu
lati
ve r
ela
tive
fre
qu
en
cy
Cumulative distributions (IUGR / Healthy)
IUGR Healthy
60
Box-and-Whisker plot
61
Q1 – first quartile, Q3 – third quartile, IQR – interquartile range, σ – standard deviation.
Box-and-whisker plot for the levels of induced production of IFN-/ in 16 healthy mothers of healthy newborns and in 20
mothers of newborns with IUGR. Free program: Instat+ http://www.reading.ac.uk/ssc/n/n_instat.htm
62
Marks for outliers
95% confidence limits for medians
medians
What did the Box Plot say to the outlier? "Don't you dare get close to my whisker!!"
What is outlier?
• Outlier is an observation that is numerically distant from the rest of the data.
• They are often indicative of measurement (or registration) errors.
• For example, if for the arterial blood pressure the value 1100 is registered, this could be misprint: either 1 or 0 is rather redundant.
• Removing of outlier(s) is a controversial practice recommended in several textbooks and manuals.
• However, the possibility should be considered that the underlying distribution for the data is not approximately normal, having "fat (heavy) tails“ or representing a mixture of two or more different distributions.
• Mixture may comprise two identical distributions, but shifted relative to each other.
• Thus, removing of outlier(s) have to be based on the extra-statistical considerations.
• “I'm not an outlier; I just haven't found my distribution yet!”
63
Mixture analysis Program PAST
Component proportion
Mean, M Standard
Deviation, SD
0.88 78.8 22.5
0.12 164.5 5.5
64
Data in healthy group can be regarded as a mixture of two normal distributions. Their proportions are 88% and 12%. The major component has sample mean about M = 79 IU/mL and standard deviation SD = 23 IU/mL. The minor component has M = 165 IU/mL and standard deviation SD = 5.5 IU/mL. However, the sample size (n1 = 16) is too small to get certain conclusion.
Effect size
65
• Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals. Updated December 2013.
• iii. Statistics
• Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to judge its appropriateness for the study and to verify the reported results.
• When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals).
• Avoid relying solely on statistical hypothesis testing, such as P values, which fail to convey important information about effect size and precision of estimates.
• http://www.icmje.org/recommendations/
• Prediction probabilities and prediction intervals should be added.
66
• Over 300 medical and biomedical journals are guided with the ICMJE recommendations.
67
Effect Size, ES • Question of the clinical (practical) importance of the observed
• Effect Size (ES)
• is a key when interpreting results of biomedical investigations (e.g., clinical trials).
• Effect Size is defined as a quantitative reflection of the magnitude of some phenomenon that is used for the purpose of addressing a question of interest.
• Kelley K., Preacher K.J. On Effect Size. Psychological Methods, 2012; 17(2): 137–152
• ES can be the difference between mean values, different kind of ratios, correlation, association etc.
• ES can be expressed either in the real measurement units, or
• as standardized (nonmetric) quantity.
68
• Analyzing samples we get conclusions on the
distributions from which they are drawn.
• In the case of comparing two independent
distributions the simplest and useful measure of the
effect size is AUC (or AUROC) – Area Under (ROC-)
Curve which relates to Mann-Whitney U-statistics.
• One of its representation is so-called dominance
diagram.
69
70
170 159 115 103 101 94 93 92 86 81 71 70 59 58 42 38
104
121
123
123
127
130
132
134
134
140
144
146
147
149
151
153
162
168
171
173
Dominance diagram Program XLSTAT http:\\www.xlstat.com
He
alth
y
IUGR
Dominance diagram
71
Umin = 35 is a number of “plus” signs, and Umax = 285 is a number of “minus” signs, and obviously: Umin + Umax = 35 + 285 = n1 × n2 = 16 × 20 = 320
• For two independent random variables X and Y ,
• Θ = P(Y > X) + 1/2 P(Y = X)
• is advocated as a general measure of effect size to characterize the degree of separation (or, conversely, overlap) of their distributions.
• It is estimated by statistic
• θ AUC = Umax / (n1 × n2),
• derived by dividing the larger observed value Umax of the Mann–Whitney statistic by the product of the two sample sizes.
• It is equivalent to the observed value of AUC - area under the receiver operating characteristic (ROC) curve.
• It has been termed the ‘probability of concordance’, ‘common language effect size’ and ‘measure of stochastic superiority’.
72
AUC - area under (ROC-) curve
• In given rectangular matrix the total cell number is a product of the two sample sizes:
• n1 n2 = 20 16 = 320
• The observed maximum value of two additive components of Mann-Whitney U-statistics is the number of yellow cells in the matrix:
• Umax = 285
• So the point estimate for AUC is:
• AUC = Umax / (n1 n2) = 285/320 = 0.89
73
Interval estimation Researchers should wherever possible, base discussion and
interpretation of results on point and interval estimates
74
What is Confidence Interval?
• Frequentist’s Confidence Interval is a random interval that covers the estimated (unknown) value of a given Parameter with the specified probability.
• Such probability is called confidence level (or confidence coefficient).
75 75
CI
• If the experiment is repeated several times, the observed values for the limits of the Confidence Interval calculated from the observations will vary from sample to sample.
• Frequently, with the probability (1 - ), it will include (cover) the estimated unknown value of parameter, but with the probability it will inevitably miss the estimated value.
• How frequently the observed interval contains the parameter is determined by the confidence level (or confidence coefficient).
• Confidence level is chosen by the researcher in accordance with his intuition.
76 76
Frequentist’s Confidence Interval (CI)
2
2
1
unknupper
lowerunkn
upperunknlower
~
~
~~
P
P
P
77 77
The meaning of the Confidence Level
• The meaning of the term “confidence level” is that, if confidence intervals are constructed across many separate data analyses of repeated (and possibly different) experiments, the proportion of such intervals that contain the true value of the parameter will approximately match the confidence level.
• So, e.g., the 95% does not attach to the one frequentist CI, it attaches to “the proportion of such intervals”.
• When only single CI is obtained, it is unknown whether it is true or not.
• Again, we come to a conclusion about the need to repeat the experiment many times.
78
Bayesian confidence (credible) interval
79 79
1ULP ~
2
LP ~
2
UP ~
Significance Level α and Confidence Level (1 – α)
Significance level,
Confidence level, (1 - )
Reliability
0.05 95% Low
0.01 99% Medium
0.001 99.9% High
80 80
Confidence interval and statistical significance
Unknown estimated by given interval value θunkn does not differ statistically from the expected value θ.
Unknown estimated by given interval value θunkn is statistically significantly smaller than the expected value θ at the significance level α.
Unknown estimated by given interval value θunkn is statistically significantly larger than the expected value θ at the significance level α.
Expected value of θ 100(1 – α)% CI for the unknown value θunkn:
81
Statistical significance and practical (clinical) importance
Estimated unknown difference is statistically nonsignificant and clinically unimportant
Estimated unknown difference is statistically significant, but clinically unimportant
CI is too wide; perhaps sample size is too small
Estimated unknown difference is statistically significant and clinically important
Expected “null” value CI
82 82
Clinically indifferent zone or reference interval
Compact form for the joint presentation
of point and interval estimations
• Example:
– AUC point estimation: 0.89 – Lower limit of the 95% CI: 0.72 – Upper limit of the 99% CI: 0.96
• Compact record: • AUC θ = 0.720.890.96
• Louis T.A., Zeger S.L. Effective communication of standard errors and confidence intervals. Biostatistics, 2009; 10(1): 1–2.
• Newcombe’s spreadsheet: GENERALISEDMW.XLS http://medicine.cf.ac.uk/primary-care-public-health/resources/
83 83
Statistical inference using confidence interval
• Obtained 95% confidence interval (CI) does not cover the indifferent value AUCindiff = 0.5.
• This means that the unknown value of AUCunkn estimated with this interval statistically significantly differs from the indifferent value AUCindiff = 0.5 (under the significance level α = 0.05).
• Consequently, we can conclude that one of two comparing random variables stochastically dominates another.
• When the shapes of both distributions are similar we can interpret this result as the statistically significant deviation of the estimated Hodges-Lehmann shift parameter from its indifferent value ΔHLindiff = 0.
84
• Strictly speaking, widespread interpretation of the Mann-Whitney U-statistic as a measure of the difference between medians of the two comparing distributions is incorrect.
• Mann-Whitney statistic is the measure of stochastic dominance of one of two independent distributions (not their medians).
• When the shapes of both distribution are similar, than Mann-Whitney statistic becomes the basis for estimating the Hodges-Lehmann shift parameter.
85
86
170 159 115 103 101 94 93 92 86 81 71 70 59 58 42 38
104 -66 -55 -11 1 3 10 11 12 18 23 33 34 45 46 62 66
121 -49 -38 6 18 20 27 28 29 35 40 50 51 62 63 79 83
123 -47 -36 8 20 22 29 30 31 37 42 52 53 64 65 81 85
123 -47 -36 8 20 l999=22 29 30 31 37 42 52 53 64 65 81 85
127 -43 -32 12 24 26 33 34 35 41 46 56 57 68 69 85 89
130 -40 -29 15 27 29 36 37 38 44 49 59 60 71 72 88 92
132 -38 -27 17 29 l99=31 l95=38 39 40 46 51 61 62 73 u95=74 90 94
134 -36 -25 19 31 33 40 41 42 48 53 63 64 75 76 92 96
134 -36 -25 19 31 33 40 41 42 48 53 63 64 75 76 92 96
140 -30 -19 25 37 39 46 47 48 54 59 69 70 81 82 98 102
144 -26 -15 29 41 43 50 51 52 58 63 73 74 85 86 102 106
146 -24 -13 31 43 45 52 53 54 60 65 75 76 u999=87 88 104 108
147 -23 -12 32 44 46 53 54 55 61 66 76 77 88 89 105 109
149 -21 -10 34 46 48 55 HL=56 57 63 68 78 79 90 91 107 111
151 -19 -8 36 48 50 57 58 59 65 70 80 81 92 93 109 113
153 -17 -6 38 50 52 59 60 61 67 72 82 83 94 95 111 115
162 -8 3 47 59 61 68 69 70 76 81 91 92 103 104 120 124
168 -2 9 53 65 67 74 75 76 82 87 97 98 109 110 126 130
171 1 12 56 68 70 77 78 u99=79 85 90 100 101 112 113 129 133
173 3 14 58 70 72 79 80 81 87 92 102 103 114 115 131 135
Applying nonparametric confidence interval for the shift parameter to the comparison of the induced production of IFN-/ in healthy group and group with
IUGR. Program StatXact http://www.cytel.com/software-solutions/statxact
• Resulting Nonparametric Hodges-Lehmann point and interval estimates of the shift parameter are:
• ΔHL = 385674 IU/mL
• This 95% confidence interval doesn’t cover the indifferent value of the shift Δindiff = 0.
• So estimated with this interval unknown value of the shift Δunkn statistically significantly differs from 0 at the significance level α = 0,05.
• Therefore the induced production IFN-α/β in IUGR group is statistically significantly higher than in healthy group.
87
Applying parametric confidence interval for the mean difference to the comparison of the induced production of IFN-/ in healthy group and group with IUGR.
Free Program ESCI JSMS.xls http://www.latrobe.edu.au/psy/esci/
• Parametric point and interval estimates of the difference of two means are:
• Δ = 335271 IU/mL
• This 95% confidence interval doesn’t cover the indifferent value Δindiff = 0.
• So estimated with this interval unknown value of the difference Δunkn statistically significantly differs from 0 at the significance level α = 0,05.
• Therefore the induced production IFN-α/β in IUGR group is statistically significantly higher than in healthy group.
88
ES Δ = 33.152.171.0 IU/mL; dC = 1.87; Student t = 5.58
Visualization of the comparison two meand using confidence interval for the mean difference Free Program ESCI JSMS.xls
http://www.latrobe.edu.au/psy/esci/
• Presented 95% confidence interval (rose triangle and vertical segment) for the mean difference doesn’t cover the indifferent value Δindiff = 0.
• So estimated with this interval unknown value of the difference Δunkn statistically significantly differs from 0 at the significance level α = 0.05.
• Therefore the induced production IFN-α/β in IUGR group is statistically significantly higher than in healthy group.
89
Blue circles are observed values. Black dots and vertical segments are point and interval estimates of the unknown means. Rose triangle and vertical segment are estimates of their unknown difference.
Newcombe’s standardized effect size: δN or StAUC
• When σ1 = σ2 = σ, θ reduces to
• Φ(δN /√2)
• that is expressed in terms of the standard deviation σ.
• Here Φ is common notation for the CDF (Cumulative Density Function) of the standard Gaussian (normal) distribution.
• θ is more preferable than δN, as it is less depends on distributional assumptions, thus more satisfactory than the standardized difference.
90
Interrelationship between AUC and StAUC
AUC θ StAUC δN Size StAUC δN AUC θ
0.5 0 0 0.50
0.55 0.18 XS extra-small
0.25 0.57
0.6 0.36 0.5 0.64
0.65 0.55 S small
0.75 0.70
0.7 0.74 1 0.76
0.75 0.95 M medium
1.25 0.81
0.8 1.2 1.5 0.86
0.85 1.5 L large
1.75 0.89
0.9 1.8 2 0.92
0.95 2.3 XL extra-large
2.5 0.96
0.99 3.3 3 0.98
0.999 4.4 XXL extra-extra-
large
3.5 0.993
4 0.998 91
Standardized Cohen’s effect size, StES dC
pooleds
MMd 21
C
92
Standardized effect size (mean difference), StES dC; how it looks like
93
Verbal scale for the interpretation of the standardized Cohen’s effect size
Standardized Cohen’s effect size, dC
Interpretation
0 – 0,5 Negligibly small (worthless)
0,5 – 1,0 Small (weak)
1,0 – 1,5 Moderate
1,5 – 2,0 Large (strong)
2,0 – 3,0 Very large (very strong)
3,0 - Extremely large
94
Once more: Statistical significance and the Effect size
• Effect (difference, association, correlation, risk, benefit, etc.) can be statistically significant, however, its practical (e.g., clinical) importance can appeared to be worthless.
• “Statistically significant” does not imply “substantial”, “practically important”, “valuable”.
• Effects can be real, nonrandom, but nonetheless, negligibly small.
95
Confidence interval for the Standardized Cohen’s Effect Size dC. Free Program LePrep
http://www.univ-rouen.fr/LMRS/Persopage/Lecoutre/PAC.htm
96
Results: point estimates and 95% confidence
intervals for the three main effect sizes
• AUC – area under the ROC-curve:
• AUC = 0.720.890.96
• StAUC – Newcombe’s standardized AUC:
• StAUC = δN = 0.81.72.5
• StES – Cohen’s standardized difference of means:
• StES = dC = 1.11.92.7
• Verbal interpretation:
• with probability 95% the estimated unknown effect sizes can be interpreted as from medium to very large (strong).
97
Statistical predictions and reproducibility
“Prediction is very difficult, especially about the future”
98
Repeat!
• Often it is believed that if the “statistically significant” result is obtained, this excludes the need of repeating the experiment.
• Testing the significance of statistical hypotheses is a method to detect rare events which deserve further investigation.
• Fisher
99
Cumming G. The New Statistics: Why and How. Psychological Science, 2014; 25(1): 7 –29.
• Three problems are central:
• Published research is a biased selection of all researches;
• data analysis and reporting are often elective and biased; and
• in many research fields, studies are rarely replicated, so false conclusions persist.
100
Replication
• A single study is rarely, if ever, definitive; additional related evidences are required.
• Such evidences may come from a close replication, which, with meta-analysis, should give more reliable estimates than the original study.
• A more general replication may increase reliability and also provide evidence of generality or robustness of the original finding.
• We need increased recognition of the value of both close and more general replications, and greater opportunities to report them.
101
Reproducibility and predictive ability of P-values and confidence intervals (n = 32). CI dance.
Free program “ESCI PPS p intervals” http://www.latrobe.edu.au/psy/esci/. Cumming G. Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Persp. Psychol. Sci., 2008; 3: 286-300.
102
• Thus, it is risky to rich definite conclusion from a single experiment only.
• Any scientific investigation should be repeated manifold.
• And a reproducibility of the results must be studied.
103
Gigerenzer G. We need statistical thinking, not rituals. Behavioral and Brain Sciences, 1998; 21(2): 199-200
• A researcher cannot be unconcerned about:
• “what would happen if additional subjects were to be included into the
experiment?”,
• “what would be the conclusion for the data of these future subjects?”,
• “what would be the conclusion for the whole data?”, or
• “what would happen if this experiment were to be repeated?”
• Asking and answering such questions goes beyond the ritualized
statistical procedures, and is likely to influence the way the authors of
scientific papers interpret experimental findings and conduct their
experiments.
• Prediction probabilities are an unavoidable part of statistical thinking
and the time is come to take them seriously.
104
Prediction and confidence intervals. Program Instat+ http://www.reading.ac.uk/ssc/n/n_instat.htm
105
Reproducibility of the absolute effect size ES for the healthy and IUGR groups at α = 0.05 and (1 – α) = 0.95
106
95% confidence interval for ES Δ is from 33 to 71 IU/mL; 95% prediction interval for it is wider: from 25 to 78 IU/mL.
10-fold increasing sample size
107
If we will repeat the experiment 10 times independently, the prediction interval will become narrower and closer to the confidence level.
Prediction interval versus confidence interval
• Note that under 10-fold repetition of the experiment the 95% prediction interval becomes closer the observed 95% confidence interval.
• This is demonstration of the meaning of confidence interval as that one which covers the estimated effect size under manifold (infinite) repetitions of the experiment.
108
Reproducibility of the standardized Cohen’s effect size dC for the healthy and IUGR groups at α = 0.05 and (1 – α) = 0.95
109
95% confidence interval for StES dC is from 1.1 to 2.7 IU/mL; 95% prediction interval for it is wider: from 0.8 to 3.1 IU/mL.
10-fold increasing sample size
110
If we will repeat the experiment 10 times independently, the prediction interval will become narrower and closer to the confidence level.
Prediction probabilities, Prep, Psrep and Preprep
111
Probability of a same-sign effect is Prep = 1.0; of a same-sign and significant at α = 0.05 is Psrep = 0.99 and of a same-sign effect with Prep = 0.99 is Preprep = 0.98.
Reproducibility of the P-value when comparing healthy and IUGR groups at α = 0.05 and (1 – α) = 0.95
112
Observed Pval = 3∙10-6. 95% prediction interval for it will be from extremely small from 3∙10-11 to the moderate 0.01.
Probabilities of replication and prediction intervals
• Thus, it is predicted that when our experiment will be repeated, than the probability to receive the same sign for the mean difference (expressed as absolute effect size ES as well as Cohen’s standardized effect size dC) will be
• Prep = 1.00.
• And the probability to receive the difference of the same sign and statistically significant at the level α = 0.05 will be
• Psrep = 0.99.
• Moreover, it is predicted that in future repetition of the experiment, the P-value could lie in very wide 95% prediction interval from very low to rather medium:
• Pval = 3∙10-11 to Pval = 0.01.
113
Main statistical tools and their destination
• Bayes Factor (BF) → comparing statistical models and/or hypotheses
• P-value → statistical hypothesis testing
• Effect Size (ES) → practical (clinical) importance
• Confidence intervals (CI) → visualization of both, the estimates and the hypotheses testing
• Prediction Intervals (PI) → prediction of future repetitions
114
Bayes theorem in action: connecting prior and posterior
probabilities
115
Reverend Thomas Bayes (c. 1702 – April 17, 1761)
116
117
Bayes Factor
• Bayes factor differs principally from P-value (Рval).
• Base factor is not a probability in itself, but a ratio of probabilities, and it can vary from zero to infinity:
• BF01 = P(Dobs|H0) / P(Dobs|H1)
• BF10 = P(Dobs|H1) / P(Dobs|H0)
• This means that using Bayes factor provide not only testing the significance of the null hypothesis, but comparison of the probabilities to obtain the observed data under both hypotheses.
• However, for this we should have a better idea of the alternative hypothesis.
Amazing property of Bayes factor in terms of “odds”
118
What are the odds?
• The odds (in favor) of an event A is the ratio of the probability that the event will happen P(A) to the probability that the event will not happen P(Ā):
• O(A) = P(A) : P(Ā) = P(A) : [1 – P(A)]
• Conversely, the odds against an event A is the opposite ratio.
• Such a representation of the probability is familiar to geneticists.
• Famous Mendel’s ratio of 3 : 1 is a representation of the probabilities 3/4 and 1/4 in terms of odds.
119
Bayes factor BF in terms of odds
• Base factor not only shows how many times the probability P(Dobs|H0) differs from the probability P(Dobs|H1).
• It also shows how many times the posterior odds in favor of one hypothesis against the other (alternative) differ from their a prior odds.
• Conversely,
• BF01 = 1/BF10
• Thus, we observe an amazing property of Bayes factor:
• without knowing prior and posterior probabilities of both hypotheses, we can quantitatively compare their odds.
120
0
1
obs0
obs1
0obs
0obs10 :
HP
HP
DHP
DHP
HDP
HDPBF
|
|
|
|
Interpretation of credibility of Bayes factors BF10 and BF01
121
BF01 Evidence in favor of hypothesis Н0 against
hypothesis Н1
>10 000 Convincing
100 – 1 000 Very strong
30 – 100 Strong
10 – 30 Moderate
3 – 10 Weak
1 – 3 Negligible
BF10 Evidence in favor of hypothesis Н1 against
hypothesis Н0
John Arbuthnot 29.04.1667 – 27.02.1735
122
Number of Christened in London during 82 years
Year Boys Girls Year Boys Girls
1629 5218 > 4683 1650 2890 > 2722
1630 4858 > 4457 3231 > 2840
4422 > 4102 3220 > 2908
4994 > 4590 3196 > 2959
5158 > 4839 3441 > 3179
5035 > 4820 3655 > 3349
5106 > 4928 3668 > 3382
4917 > 4605 3396 > 3289
4703 > 4457 3157 > 3013
5359 > 4952 3209 > 2781
5366 > 4784 1660 3724 > 3247
1640 5518 > 5332 4748 > 4107
5470 > 5200 5216 > 4803
5460 > 4910 5411 > 4881
4793 > 4617 6041 > 5881
4107 > 3997 5114 > 4858
4047 > 3919 4678 > 4319
3768 > 3395 5616 > 5322
3796 > 3536 6073 > 5560
3363 > 3181 1669 6506 > 5829
1649 3079 > 2746
Year Boys Girls Year Boys Girls
1670 6278 > 5719 1691 7662 > 7392
6449 > 6061 7602 > 7316
6443 > 6120 7676 > 7483
6073 > 5822 6985 > 6647
6113 > 5738 7263 > 6713
6058 > 5717 7632 > 7229
6552 > 5847 8062 > 7767
6423 > 6203 8426 > 7626
6568 > 6033 7911 > 7452
6247 > 6041 1700 7578 > 7061
1680 6548 > 6299 8102 > 7514
6822 > 6533 8031 > 7656
6909 > 6744 7765 > 7683
7577 > 7158 6113 > 5738
7575 > 7127 8366 > 7779
7484 > 7246 7952 > 7417
7575 > 7119 8379 > 7687
7737 > 7214 8239 > 7623
7487 > 7101 7840 > 7380
7604 > 7167 1710 7640 > 7288
1690 7909 > 7302
• Total 484 382 > 454 041
• Total sum 938 423
123
Comparison of the frequentist and Bayesian results
• Testing homogeneity (independence) of the Arbuthnot data results in:
• Pval ≈ 10-8
• BF01 = 8∙10117
• From the frequentist point of view the heterogeneity of Arbuthnot data is statistically highly significant.
• From the Bayesian point of view the conclusion is diametrically opposite:
• To obtain such data is 8∙10117 times more likely under the hypothesis H0 on their homogeneity then under the alternative hypothesis H1 on their heterogeneity.
• Or:
• The posterior odds in favor of the null hypothesis against alternative hypothesis are 8∙10117 times higher then their prior odds.
124
Bayes Factor, online program Bayes Factor Calculators http://pcl.missouri.edu/bayesfactor
125
Output
• BF01 = 0.00018 and
• BF10 = 1/ BF01 = 5555.5
• It is 5555 times more likely
to obtain the value of the
Student t-test statistic t =
5.58 with df = 34 under the
H1: 0 than under H0: =
0.
• According to the verbal
scale such value of BF10 is
interpreted as convincing
evidence in favor of H1
against H0.
126
Summary
Statistical evidences
• AUC θ = 0.720.890.96
• StAUC δN = 0.81.72.5
• StES dC = 1.11.92.7
• ΔHL = 385674 IU/mL
• Δ = 335271 IU/mL
• BF10= 5555
• Pval = 3∙10-6
Statistical predictions
• 95% prediction intervals:
• From 0.8 to 3.1 IU/mL
• From 25 to 79 IU/mL
• From 3∙10-11 to 0.010
• Probability of replication:
• Psrep = 0.99
127
Example 2
TGT – Thrombin Generation Test
128
Castoldi E., Rosing J. Thrombin generation tests. Thrombosis Research, 2011; 127(Suppl. 3): S21–S25
• Parameters of the thrombin generation curve:
• LT – lag time, min
• TTP – time to peak, min
• PT – peak thrombin, nM
• ETP – endogenous thrombin potential, nM∙min
• V – maximum velocity of thrombin generation, V = PT / (TTP – LT), nM/min
129
Estimation of parameters of TGT, results of traditional NHST and effect sizes. n1 = 40, n2 = 53
LT, min ETP, nM∙min TTP, min PT, nM V, nM/min
RI 8.0 – 27.4 1290 – 2480 17 – 41 85 – 192 5.3 – 25.4
M1 14 16 17 1820 1900 1990 25 27 28 125 134 144 11 13 15
M2 15 17 19 1640 1740 1830 29 31 33 100 106 113 7.1 7.9 8.7
Pval 0.37 0.015 0.0012 3∙10-6 10-8
Effect sizes
ΔHL -3.3 -1.0 1.2 52 188 323 -7.3 -4.6 -1.8 14 28 40 3.3 4.6 6.0
SE Δ -3.4 -1.3 0.7 43 167 294 -7.1 -4.5 -2.1 17 28 39 3.4 5.1 6.7
AUC θ 0.44 0.55 0.67 0.55 0.67 0.77 0.68 0.70 0.79 0.66 0.77 0.85 0.73 0.83 0.90
StAUC δN -0.61 -0.20 0.22 0.19 0.63 1.04 -1.13 -0.72 -0.28 0.53 1.06 1.48 0.89 1.36 1.80
StES dC -0.66 -0.25 0.16 0.10 0.52 0.94 -1.15 -0.73 -0.30 0.65 1.09 1.53 0.89 1.35 1.80
n1 and n2 – sample sizes of the control and CAD groups; RI – nonparametric reference interval; М1 and М2 – sample means; Pval – P-value; ΔHL – Hodges-Lehmann shift estimate; Δ = М1 – М2 – effect size in real units; θ - area under ROC-curve; δN and dC
– Newcombe’s and Cohen’s standardized effect sizes. Programs: Reference Value Advisor, PAST, StatXact, GENERALIZED.xls, ESCI-JSMS.xls, LePrep.
130
Informativeness of TGT parameters 53 CHD patients and 40 people without clinical manifestations of
coronary heart disease (data by Berezovskaya G.A.)
131
dC – standardized Cohen’s effect size, Pval – Р-value, BF10 – Bayes factor for comparison of odds in favor of H1 versus H0, Psrep – probability of statistically significant effect of the same sign (direction) in a replication, Power – “achieved” power, n1 = n2 – minimum sample sizes for replication. Programs: ESCI-JSMS.xls, Online BF Calculator (http://pcl.missouri.edu/bayesfactor), LePrep, G*Power
Syndrome of statistical leniency and credulity
Fallacies and Confusions of Null Hypothesis Significance Testing
(NHST) and P-value
“What does a statistician call it when the heads of 10 rats are cut off and 1 survives?
- Nonsignificant.”
132
P-value
• P-value is the most controversial concept in statistics.
• Many textbook authors and the majority of experimenters do not understand what its final product – a P-value – actually means (Gigerenzer, 1988).
• The concept of a P-value lies so far from the intuitive understanding that no ordinary person can hold it in memory.
• ‘‘We rely too much on P values, and most of us really don’t have a clue what they mean.’’
• Lai J., Fidler F., Cumming G. Subjective p intervals: Researchers underestimate the variability of p values over replication. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 2012; 8: 51-62.
133
What is P-value? What is null hypothesis H0?
• A P-value is the probability of observing data as or more extreme as the actual outcome when the null hypothesis is true.
• When testing null hypothesis we transform data into a test statistic.
• Then the P-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.
• Usually the null hypothesis is a statement of 'no effect' or 'no difference'.
• The Null Hypothesis is often denoted H0 (read “H-nought”)
134
Null Hypothesis Significance Testing Waltz
• The P value is at the heart of the most common approach to data analysis – Null Hypothesis Significance Testing (NHST).
• Think of NHST as a waltz with three steps:
• (i) State a null hypothesis: that is, there is no effect.
• (ii) Calculate the p value, which is the probability of getting results like ours and more extreme – if the null hypothesis is true.
• (iii) If Pval is sufficiently small, reject the null hypothesis and sound the trumpets:
• our effect is not zero, it's statistically significant!
• Generations of students have been inducted into the rituals of .05 meaning "significant", and .01 "highly significant".
135
Р-value, Рval
• Thus, by definition, the P-value (Pval) is the conditional probability of obtaining the observed value of difference (dobs) and all other larger or less probable values (D ≥ dobs|H0), when the null hypothesis is true:
• Pval = P(D ≥ dobs|H0).
• In terms of the statistical hypothesis testing, P-value is:
• The probability to obtain the modulus of observed value |tobs| of the test statistic T and all other larger or less probable values (i.e., the values even more deviating from the expected one)
• under assumption that the null hypothesis H0 is true:
•
• Pval = P(|T| ≥ |tobs.| | H0).
• Note that the “less probable values” are not observed.
• We infer them out of all possible values in the frame of the chosen (null) model.
136
• A P-value is usually interpreted as a measure of how much evidence we have against the null hypothesis, how much is contradiction between null hypothesis and observed data.
• The null hypothesis, traditionally represented by the symbol H0, represents the hypothesis of no change or no effect.
• The smaller the P-value, the more (stronger) evidence we have against H0.
137
What is Test Statistic? • Test statistic is a statistic used for the testing the given null
hypothesis.
• Example: Student t-test statistic:
• In such a case testing the null hypothesis H0 on the equality of two independent means (H0: M1 – M2 = 0) is reduced to the testing the null hypothesis on the t = 0.
• When this hypothesis is true, than the distribution of the t-statistic is known.
• Namely, it is the Student t-distribution.
• This distribution has a single parameter called degrees of freedom, df.
22121
21
nndf,s~
M~
M~
t~
MM
138
William Sealy Gosset (June 13, 1876–October 16, 1937) is famous as a statistician, best known by his pen name Student and for his work
on Student's t-distribution.
139
n1 = 5, n2 = 7, df = 10, t = 1,5 P = 0,16 – the difference is statistically nonsignificant
140
http://ftparmy.com/103097-decision-visualizer.html
n1 = 5, n2 = 7, df = 10, t = 3,0 P = 0,013 – the difference is statistically significant at
the significance level α = 0,05, but not at 0,01
141
Searching the threshold for the P-value: is it possible?
• When small P-value is observed, the intuitive (extrastatistical) temptation appears to reject null hypothesis H0.
• However, there is no statistical reason what P-value would be regarded as sufficiently small to reject H0 safely.
• Once again, such decision is extrastatistical.
• In practice, decision to reject or accept H0 must depend on circumstances.
• In each specific (concrete) situation researcher should make her/his choice by oneself.
142
143
Traditional interpretation of the P-values (Pval)
(and their Michelin star scale)
143
P-value (Pval) Statistical significance Michelin stars
> 0,05 Nonsignificant
0,05 – 0,01 Moderately significant *
0,01 – 0,001 Significant **
0,001 – 0,0001 Highly significant ***
< 0,0001 Extremely significant ****
Four stars value 0,0001 was introduced recently by Harvey J. Motulsky: http://www.graphpad.com/guides/prism/6/statistics/index.htm?interpreting_a_small_p_value_from_an_unpaired_t_test.htm
Tyranny and/or hypnosis of the figures 0.05 and 95%
• Unfortunately, as a threshold the significance level α = 0.05 is most commonly used.
• Too often the overcoming this threshold level (Pval < 0.05) solely in a single experiment is regarded as sufficient for the decision to reject the null hypothesis and conclude on the statistical significance of the observed effect.
144
Andrey Nikolaevich Kolmogorov (25 April 1903 – 20 October 1987)
• In statistics, the recommended significance level varies from 0.05 for preliminary orientation experiments to 0.001 for important ultimate conclusions, but the attainable reliability of probability conclusions is often much higher.
• Thus, the principal conclusions of statistical physics are based on the neglect of probabilities of an order less than 10−10.
• (1951)
145
http://www.encyclopediaofmath.org/index.php/Probability
Sterne J.A.C., Davey Smith G. Sifting the evidence –
what’s wrong with significance tests? BMJ, 2001; 322: 227-231. Cited by 763
• Presently, several other authors echo to Kolmogorov:
• P-value closer to 0.05 is not a strong evidence against null hypothesis.
• As a strong evidence against Н0 Pval < 0.001 should be regarded.
• In addition to P-values it is strongly recommended to present confidence intervals for the effect size.
146
“Flexible” P-values
• In fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses;
• he rather gives his mind to each particular case in the light of his evidence and his ideas.
•
• Fisher R. A. Statistical Methods and Scientific Inference, 1956, pages 41-42.
147
Sir Ronald Aylmer Fisher 17 Feb 1890 - 29 July 1962
148
Warrning
• Usually P-value is interpreted as a measure for the evidence given by the available data against the null hypothesis.
• Strictly speaking, however, it is not a measure in mathematical sense.
• It does not possess the additivity property, and moreover,
• it does not satisfy to two the more important principle of the statistical theory – The Likelihood Principle and the P-postulate.
149
Likelihood Principle
• Verbosely, the Likelihood Principle is a statement that statistical analysis must operate with that and only that data which are actually obtained in the experiment.
• However, for the calculation of Р-value (as it follows from its definition), not only the observed experimental data are used, but all other, less probable, which were not observed in fact.
150
Р-postulate
• To serve as real and adequate measure of the statistical evidence, Р-value should satisfy the simple rule (postulate) according to which the same Р-values have to present equal evidences against the null hypothesis.
• This rule is called «Р-postulate».
• Obviously, this minimal requirement is not met.
•
• Wagenmakers E.-J. A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 2007; 14(5): 779-804.
151
Р-postulate
• Intuitively one can recognize that Рval = 0.01 in the experiment with 10 observations will not demonstrate the same evidential strength as Рval = 0.01 in the experiment with 300 observations.
• Equally, Рval = 0.001, obtained in one experiment and Рval = 0.01 in another does not imply that the effect observed in the first experiment is 10 times more evidential than in the second.
152
P-value is the realization of corresponding random variable P*
• P-value is an observed value of the corresponding random variable
• P*
• When null hypothesis H0 is true, then Pval has so called (continuous) standard uniform distribution, that is uniform distribution on the interval [0; 1]:
• P* ~ Uni[0; 1].
153
P-value distributions Pike N. free spreadsheet: FDR.xls http://www.webcitation.org/5rxSzU7qL
Δ = μ1 – μ2 = 0;
χ2 = 390,6; df = 400; Pval = 0,62 Δ = μ1 – μ2 = 10;
χ2 = 1348,8; df = 400; Pval = 4∙10-101
154
0
20
40
60
80
100
120
0.0
5
0.1
0
0.1
5
0.2
0
0.2
5
0.3
0
0.3
5
0.4
0
0.4
5
0.5
0
0.5
5
0.6
0
0.6
5
0.7
0
0.7
5
0.8
0
0.8
5
0.9
0
0.9
5
1.0
0
Fre
quency o
f valu
es in r
ange
p-value defining upper limit of range
Frequency distribution of p-values
Observed frequency Expected frequency
0
2
4
6
8
10
12
14
16
0.0
5
0.1
0
0.1
5
0.2
0
0.2
5
0.3
0
0.3
5
0.4
0
0.4
5
0.5
0
0.5
5
0.6
0
0.6
5
0.7
0
0.7
5
0.8
0
0.8
5
0.9
0
0.9
5
1.0
0
Fre
quency o
f valu
es in r
ange
p-value defining upper limit of range
Frequency distribution of p-values
Observed frequency Expected frequency
These are histograms obtained with 200 simulations.
Reproducibility and predictive ability of P-values and 95% confidence intervals (n = 32). Dance of Pval
Free program “ESCI PPS p intervals” http://www.latrobe.edu.au/psy/esci/. Cumming G. Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Persp. Psychol. Sci., 2008; 3: 286-300.
155
Reproducibility and predictive ability of P-values and 95% confidence intervals (n = 32). Dance of Pval
Free spreadsheet “ESCI PPS p intervals” http://www.latrobe.edu.au/psy/esci/. Cumming G. Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Persp. Psychol. Sci., 2008; 3: 286-300.
156
Reproducibility of the P-value when comparing healthy and IUGR groups at α = 0.05 and (1 – α) = 0.95
157
Observed Pval = 3∙10-6. 95% prediction interval for it will be from extremely small from 3∙10-11 to the moderate 0.01.
Popular temptation
• It is conventional to interpret the quintessence of traditional (frequentist) conclusions from the statistical hypotheses testing as:
• The less P-value, the stronger is evidence (which is presented by the data) against null hypothesis H0 the bigger is a reason to doubt in H0.
• Hence, whether intentionally or not (and seems rather naturally), the temptation appears to interpret P-value as a probability of the null hypothesis.
158
Popular delusion • P-value is not a probability of null hypothesis!
• P-value is calculated
• under the assumption
• that null hypothesis H0 is true:
• Pval = P(|D| ≥ |dobs||H0),
• Hence, P-value cannot be a probability of null hypothesis:
• P{D|H0} ≠ P{H0|D}
• Collection of other fallacies about P-value see, e.g.:
• http://en.wikipedia.org/wiki/P-value
• Goodman S. A dirty dozen: Twelve P-value misconceptions. Semin. Hematol., 2008; 45: 135-140
159
Calibration of P-values
• Vovk V. G. A logic of probability, with application to the foundations of statistics. Journal of
the Royal Statistical Society. Series B (Methodological), 1993; 55(2): 317-351. • Sellke T., Bayarri M.J., Berger J.O. Calibration of p values for testing precise null hypotheses.
The American Statistician, 2001; 55(1): 62-71. Cited by 321 • When
• - lower bound for the probability of the null hypothesis H0
01
010
1 BF
BFDHP
|
160
eP 1val
valval01 lnPePBF
161
The “price” of P-values
Observed P-value
Upper limit of 80% intreval for
Pval
Lower limit for the probability of hull
hypothesis P(H0)
Upper limit for the probabililty of
repeat Рrepr
0.05 0.44 ≥ 29% < 50%
0.01 0.22 ≥ 11% < 73%
0.001 0.07 ≥ 1.8% < 90%
Sellke T., Bayarri M.J., Berger J.O. Calibration of p values for testing precise null hypotheses. The American Statistician, Vol. 55, No. 1. (2001), pp. 62-71. Goodman S.N. A comment on replication, p-values and evidence // Statistics in Medicine, 1992. – Vol. 11. – P. 875-879. Cumming G. Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better // Perspectives on Psychological Science, 2008. – Vol. 3. – No. 4. – P. 186-300.
161
The problem with p values: how significant are they, really? November 12th, 2013 Geoff Cumming
http://phys.org/wire-news/145707973/the-problem-with-p-values-how-significant-are-they-really.html
A p value of 0.05 has been the default ‘significance’ threshold for nearly 90 years … but is that standard too weak? Martin_Heigan
162
Funny metaphor
• “Perhaps p values are like mosquitos.
• They have an evolutionary niche somewhere and no amount of scratching, swatting, or spraying will dislodge them”.
• Campbell J.P. Editorial: Some remarks from the outgoing editor. Journal of Applied Psychology, 1982; 67: 691-700
163
• The usefulness of P-values is quite limited, and we continue to suggest that these procedures be euthanized.
• Anderson D.R., Burnham K.P. Avoiding pitfalls when using information-theoretic methods. The Journal of Wildlife Management, 2002; 66(3): 912-918.
164
On seduction: • Yes, the P-value can seduce.
• It is sexy and we can be blinded.
• A significant P-value can perplex our thinking, where we simply get too excited and forget to look at the actual effect size.
• Does that < 0.05 really matter when the effect size is small?
• The study which concluded that the "internet is changing the dynamics and outcomes of marriage itself“ can be an example.
• This study showed that those who meet their spouses online are less likely to divorce and more likely to have high marital satisfaction (of course with very significant P-values).
• However, the effect size was very very small where happiness, for example, barely moved from 5.48 to 5.64.
• So, do not sign up for match.com thinking that you may be happier with your spouse.
165
Meaning of the P-value: Publish or Perish
166
Pee-value (http://wmbriggs.com/blog/?p=9338)
167
Statistics is the only field in which men boast of their wee p-values
• Revised standards for statistical evidence
• Valen E. Johnson
• PNAS, 2013; 110(48): 19313–19317
• Supporting Information:
• Johnson 10.1073/pnas.1313476110
168
Evidence thresholds γ and size of corresponding significance tests α
169
Revised standards for statistical evidence
• A simple strategy for improving the replicability of scientific
research includes the following steps:
• (i) Associate statistically significant test results with P values
that are less than 0.005.
• (ii) Associate highly significant test results with P values that
are less than 0.001 (cf. Kolmogorov) and even 0.0001.
• (iii) Report the Bayes factor in favor of the alternative
hypothesis and the default alternative hypothesis that was
tested.
170
Revised standards for statistical evidence
• (iv) BF10 > 30 or even > 100 should be considered as strong and convincing evidence in favor of alternative hypothesis H1.
• Proposed modifications of common standards of evidence intend to reduce the rate of nonreproducibility of scientific results by a factor of 5 or greater.
• Certainly, the larger sample sizes are required.
171
Minimum sizes for two independent samples with non-overlapping values required to achieve the lower confidence
limits for two measures of the effect size: AUCL and SESL
Lower confidence limits for the effect size measured
with: Confidence levels
AUCL StAUCL 0.95 0.99 0.999
0.80 1.2 10 17 27
0.90 1.8 21 35 56
0.95 2.3 40 69 111
0.99 3.3 194 334 545
0.999 4.4 1923 3320 5418
Extrapolated using Newcombe’s free spreadsheet VISUALISETHETA.xls http://medicine.cf.ac.uk/primary-care-public-health/resources/
172
Джон Уайлдер Тьюки (John Wilder Tukey, 16.04.1915 — 26.07.2000)
• Any research should be at least two-staged.
• First stage – exploratory (preliminary, pilot, hypotheses generating) study.
• Second stage – confirmatory study.
• The second stage is designed on the basis of the results obtained at the first stage.
173
Conclusions
• Bad reproducibility of experimental results becomes a systemic problem in biomedicine.
• One of the main reason of this is inadequate statistical analysis.
• Statistical analysis should be comprehensive harmonizing statistical evidences and predictions as well as frequentist and Bayesian approaches.
• It is insufficient to carry out the null hypothesis significance testing (NHST) reporting P-values.
174
Conclusions (continued)
• Statistical significance doesn’t mean clinical importance.
• Effect size with confidence and prediction intervals should be reported.
• Experiments an/or observations should be repeated many-many times and their agreement should be investigated.
• The best way is to repeat the experiments independently in different laboratories (in different countries).
175
Editorial politics
• Journal editors and reviewers should not accept for publications the papers if they report results of a single experiment and no results of the independent replication.
• Experts on statistics should be included in the editorial boards.
• Reviewers should be obliged to re-examine all the calculations.
• For this reason the free access to the initial (“raw”) data should be ensure.
• Transparency and openness are cornerstones of the scientific method.
176
Francis Galton, 1901
• “I have begun to think that no one ought to publish biometric results, without lodging a well-arranged and well-bound copy of his data in some place where it should be accessible, under reasonable restrictions, to those who desire to verify his work.”
• Galton F. Biometry. Biometrika, 1901; 1(1): 7-10.
• Galton’s suggestion of a store data had been revived by Professor Julian Huxley, and
suggestion made for storing measurements
in the British Museum of Natural History.
177
• One of the most common and leading to the biggest disaster of temptations is tempting with the words: "Everybody does it"
• Leo Tolstoy
178
Books on Bayesian biostatistics
179
180
Lesaffre E., Lawson A. Bayesian Biostatistics. Bayesian Biostatistics. 2012. Wiley. 534 p.
Broemeling L.D. Bayesian Biostatistics and Diagnostic Medicine. 2007. CRC Press, 216 p.
181
Kruschke J. Doing Bayesian Data Analysis. 2010. Academic Press, 672 p.
182
Downey A.B. Think Bayes: Bayesian Statistics Made Simple. Version 1.0.1, 2012. Green Tea Press: Needham, Massachusetts, 195 p.
Albert J. Bayesian Computation with R. Series: Use R! 2nd ed. 2009, Springer, 299 p.
Free Software • Educational: SUStats,
http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html
• WinStat http://math.exeter.edu/rparris/winstats.html
• SOCR http://www.socr.ucla.edu/
• Research: R http://cran.r-project.org/
• PAST http://folk.uio.no/ohammer/past/
• Instat+ http://www.reading.ac.uk/ssc/n/software/instat/337/Instat+_v3.37.msi
• Online Bayes Factor Calculator http://pcl.missouri.edu/bayesfactor
• LePAC and LePrep http://www.univ-rouen.fr/LMRS/Persopage/Lecoutre/PAC.htm
• G*Power http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/
• Reference Value Advisor http://www.biostat.envt.fr/spip/spip.php?article63
• Newcombe’s spreadsheets http://medicine.cf.ac.uk/primary-care-public-health/resources/
• Cumming’s spreadsheets ESCI http://www.latrobe.edu.au/psy/esci/
• Harold Kaplan statistical pages http://printmacroj.com/statistics.htm
• Commercial:
• StatXact http://www.cytel.com/software-solutions/statxact
• XLStat http:\\www.xlstat.com
183
Commercial Software • StatXact http://www.cytel.com/software-solutions/statxact
• XLStat http:\\www.xlstat.com
• MedCalc https://www.medcalc.org/
• GraphPad Prism http://www.graphpad.com/
• StatsDirect http://www.statsdirect.com/
• Expensive monsters:
• SAS http://www.sas.com/en_us/home.html
• IBM SPSS http://www-01.ibm.com/software/analytics/spss/
• STATISTICA http://www.statsoft.com/
• John C. Pezzullo’s comprehensive list of statistical software: http://statpages.org/
184
Thank you for your attention
Slides are freely available to all
Nikita N. Khromov-Borisov Department of Physics, Mathematics and Informatics
Pavlov First Saint Petersburg State Medical University
+7-952-204-89-49; +7-921-449-29-05 http://independent.academia.edu/NikitaKhromovBorisov
185