harmonizing statistical evidences and predictions

International Life Sciences Workshop “Decision-Making in Biomedical Science – Meet Experts”

September 12 – 16 | 2014 Potsdam | Germany

Harmonizing statistical evidences and predictions

Nikita N. Khromov-Borisov

Pavlov First Saint Petersburg State Medical University Saint Petersburg, Russia

[email protected] +7 952-204-89-49; +7 921-449-29-05

http://independent.academia.edu/NikitaKhromovBorisov https://www.researchgate.net/profile/Nikita_Khromov-Borisov?ev=hdr_xprf

1

Slides are freely available to all

Nikita N. Khromov-Borisov Department of Physics, Mathematics and Informatics

Pavlov First Saint Petersburg State Medical University

[email protected]

+7-952-204-89-49; +7-921-449-29-05 http://independent.academia.edu/NikitaKhromovBorisov

2

The best way to discuss scientific issues is to discuss them in a foreign language

Max Ludwig Henning Delbrück, (September 4, 1906 – March 9, 1981)

Piotr Slonimski (November 9, 1922 – April 25, 2009)

3

Second hand teaching

• The History of Science has suffered greatly from the use by teachers of second-hand material, and the consequent obliteration of the circumstances and the intellectual atmosphere in which the great discoveries of the past were made.

• A first-hand study is always instructive, and often . . . full of surprises.

• Ronald A. Fisher, 1955 • Cited by: Ziliak S.T., McCloskey D.N. The Cult of Statistical

Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. The University of Michigan Press, Ann Arbor, 2008, 321 pp.

• http://stephentziliak.com/

4

Crisis of reproducibility of the results in biomedicine

5

The essences of science are replication and reproducibility

• The essence of science is replication: • a scientist should always be concerned about what would

happen if he or another scientist were to repeat his experiment.

• Guttman L. What is not what in statistics. The Statistician, 1977; 26(2): 81-107.

• Scientists have elaborated method of determining the validity of their results.

• They learned to ask the question: are they reproducible? • Scherr G.H. Irreproducible Science: Editor’s Introduction. • In The Best of the Journal of Irreproducible Results,

Workman • Publishing, New York, 1983. • Reproducibility is like the ghost that will always come back

to haunt you. • http://datapede.blogspot.ru/2014/03/part-1z-p-value-surviving-mosquito.html

6

Loscalzo J. Irreproducible Experimental Results: Causes, (Mis)interpretations, and Consequences. Circulation, 2012; 125: 1211-1214.

• In Science what is relevant is reproducible results.

• If an initial observation is found to be reproducible, then it must be true.

• If an initial observation is found not to be reproducible, then it must be false.

• Many readers of scientific journals—especially of higher-impact journals—assume that if a study is of sufficient quality to pass the scrutiny of rigorous reviewers, it must be true.

• This assumption is based on the inferred equivalence of reproducibility and truth.

7

• Long ago Fisher . . . recognised that . . . solid knowledge came from a demonstrated ability to repeat experiments . . .

• This is unhappy for the investigator who would like to settle things once and for all, but consistent with the best accounts . . . of the scientific method . . .

• Tukey J.W. The philosophy of multiple comparisons. Statistical Science, 1991; 6: 100-116.

8

Tukey J.W. Analyzing data: Sanctification or detective work? American Psychologist, 1969; 24: 83–91.

• Nothing learned is certain. • We learn by taking chances. • Every modern learning theorist expects learning to be by trial,

with some errors. • This is as true for science as for the individual. • Confirmation comes from repetition. • Repetition is the basis for judging varilability and significance and

confidence. • Repetition of results, each significant, is the basis, according to

Fisher, of scientific truth. • Certainty is an illusion. • As an illusion, certainty can be wasteful, as well as misleading. • Data analysis needs to be both exploratory and confirmatory.

9

From the history of epidemiological studies: Risk factors for cancer [Jenks S., Volkers N. Razors and Refrigerators and Reindeer — Oh My!

JNCI, 1992; 84(24):1863]

• Using electric razor: Increase the risk of developing leukemia.

• Distal forearm fractures in women: Reduction in overall cancer incidence, breast cancer incidence, and incidence of tumors.

• Fluorescent lighting: Melanoma in male but not in females.

• Allergies and cancer: At first the inverse relationship. Later several types of cancer were elevated. However, ovarian cancer risk decreased with increasing numbers of allergies.

• Breeding reindeer: in Swedish Lapps decreased risks for cancers of the colon, female breast, male genital tract, kidneys, respiratory system, and for lymphomas. However, increased risk for stomach cancer.

10

From the history of epidemiological studies: Risk factors for cancer [Jenks S., Volkers N. Razors and Refrigerators and Reindeer — Oh My! JNCI,

1992; 84(24): 1863]

• Waiters in Norway: Decreased risk of stomach cancer but excess risks of cancers of the liver, rectum, upper respiratory and digestive tracts, and lung. Higher mortality rate from lung cancer.

• Owning a pet bird: Fourfold increase in lung cancer risk among pigeon fanciers (more hazardous than living with a smoker). Owners of budgies, canaries, finches, or parrots were OK.

• Height: Lower risks for some cancers in short men, particularly colorectal cancer, and lower risks for this cancer and for breast cancer in short women. But being tall may confer some advantage for certain cancers (esophageal, endometrial and cervical), while tall men have only a slightly elevated risk for prostate, kidney and colon cancers.

• Refrigerators: Seems protect everyone from stomach cancer.

11

• An extensive list of curious and questionable medical observations about the various risk factors, was given in the work:

• Buchanan A.V., Weiss K.M., Fullerton S.M.

• Dissecting complex disease: the quest for the Philosopher’s Stone?

• International Journal of Epidemiology 2006. – Vol. 35. – P. 562–571

12

Table of irreproducible results?

• Hormone replacement therapy and heart disease

• Hormone replacement therapy and cancer • Stress and stomach ulcers • Annual physical checkups and disease

prevention • Behavioural disorders and their cause • Diagnostic mammography and cancer

prevention • Breast self-exam and cancer prevention • Echinacea and colds • Vitamin C and colds • Baby aspirin and heart disease prevention • Dietary salt and hypertension • Dietary fat and heart disease • Dietary calcium and bone strength • Obesity and disease • Dietary fibre and colon cancer • The food pyramid and nutrient RDAs • Cholesterol and heart disease • Homocysteine and heart disease

• Inflammation and heart disease • Olive oil and breast cancer • Fidgeting and obesity • Sun and cancer • Mercury and autism • Obstetric practice and schizophrenia • Mothering patterns and schizophrenia • Anything else and schizophrenia • Red wine (but not white, and not grape juice)

and heart disease • Syphilis and genes • Mothering patterns and autism • Breast feeding and asthma • Bottle feeding and asthma • Anything and asthma • Power transformers and leukaemia • Nuclear power plants and leukaemia • Cell phones and brain tumours • Vitamin antioxidants and cancer, aging • HMOs and reduced health care cost • HMOs and healthier Americans • Genes and you name it!

13

‘Blood group mythology’: myths about AB0

• Human blood group system AB0 can serve as an classic example of unacknowledged associations with the different conditions.

• Several incredible phenomenon were reported:

• Persons with A have more severe hangovers;

• Persons with B defecate the most;

• Persons with 0 have more healthy teeth;

• Military with 0 are spineless and with B are more impulsive;

• Persons with B are more prone to crime;

• Strong connection between AB0 and nutrition;

• Persons with A2 have the highest IQ;

• A is significantly more common among members of the higher socio-economic groups.

• All these associations are not reproduced and virtually forgotten.

14

• Large companies in Japan still use blood types when advertising for, or evaluating, job applicants.

• George Garratty

• Association of Blood Groups and Disease: Do Blood Group Antigens and Antibodies Have a Biological Role?

• History and Philosophy of the Life Sciences, 1996; Vol. 18, No. 3, The First Genetic Marker, p. 321-344.

15

• The only associations between AB0 blood groups and malignant neoplasms, thrombosis, peptic ulcers, bleeding, bacterial and viral infections are still regarded as statistically “proven“.

• Alas, these associations have no clinical (practical) importance due to low values of odds ratio (OR) which do not exceed the value of OR = 1.5.

16

Associations between AB0 blood groups and diseases, which are still considered to be statistically “proven”

Medical condition A > 0 0 > A B/AB > A/0 OR

Malignancy X 1.2 – 1.3

Thrombosis X

Peptic ulcers X 1.2 – 1.4

Bleeding X 1.5

E. coli / Salmonella X

17

Note that here we meet extremely important issue of clinical (or any other practical) importance (significance) of the observed associations. Here clinical importance is demonstrated with one of the measures of the effect size such as odds ratio (OR).

Begley C.G., Ellis L.M. Raise standards for preclinical cancer research. Nature, 2012; 483: 531-533.

• Recently Glenn Begley, former vice president of the well-known biotech company Amgen, and his colleague Lee Ellis published the results of their efforts to replicate findings from recent publications in the clinical oncology literature.

• The data were disturbing.

• Of 53 papers, only 6 (11%) were reproducible.

• Begley and Ellis state that the

• poor reproducibility of the results becomes a systemic problem of modern science.

• In one study, which was cited in a short period more than 1900 times, even the authors themselves later were unable to reproduce their own results.

18

Increasing replication of un-reproducibility in science

• Gautam Naik: Scientists' Elusive Goal: Reproducing Study Results. The Wall Street Journal, December 2, 2011.

• This is one of medicine’s dirty secrets:

• Most results, including those that appear in top-flight peer-reviewed journals, can’t be reproduced.

19

Macleod M.R., Michie S., Roberts I., Dirnagl U., Chalmers I., Ioannidis J.P.A., Al-Shahi Salman R., Chan A.-W., Glasziou P. Biomedical research: increasing

value, reducing waste. The Lancet, 2014, 383(9912): 101-104

• Of 1575 reports about cancer prognostic markers published in 2005, 1509 (96%) detailed at least one significant prognostic variable.

• However, few identified biomarkers have been confirmed by subsequent research and few have entered routine clinical practice.

• This pattern — initially promising findings not leading to improvements in health care — has been recorded across biomedical research.

• So why is research that might transform health care and reduce health problems not being successfully produced?

20

Ioannidis J.P.A.

Why most published research findings are false.

PLoS Med., 2005. – Vol. 2. – No. 8. – Paper: e124.

Cited by 2174

21

Reproducibility Initiative http://validation.scienceexchange.com/#/

22

• PLOS ONE Launches Reproducibility Initiative

• http://validation.scienceexchange.com/#/

• Reproducibility Initiative receives $1.3M grant to validate 50 landmark cancer studies

• Reproducibility Project: Psychology

• https://osf.io/ezcuj/wiki/home/

• Special Section on Replicability in Psychological Science

• Perspectives on Psychological Science, 2012; 7(6): 528 –530

23

• Journal of Negative Results in BioMedicine is an open access, peer-reviewed, online journal that provides a platform for the publication and discussion of unexpected, controversial, provocative and/or negative results in the context of current tenets.

• Editor-in-Chief

• Bjorn R Olsen, Harvard Medical School

24

Challenges in irreproducible research

• No research paper can ever be considered to be the final word, and the replication and corroboration of research results is key to the scientific process.

• In studying complex entities, especially animals and human beings, the complexity of the system and of the techniques can all too easily lead to results that seem robust in the lab, and valid to editors and referees of journals, but which do not stand the test of further studies.

• http://www.nature.com/nature/focus/reproducibility/index.html

25

Statistics

“A subject which most statisticians find difficult but in which nearly all

physicians are expert.”

26

• Statistical flaws are a major cause of irreproducible

results in all types of biomedical experimentation.

• These include errors in trial design, data analysis, and

data interpretation.

• “If experimentation is the Queen of the sciences,

surely statistical methods must be regarded as the

Guardian of the Royal Virtue.”

• Myron Tribus

(Letter to Science)

27

Statistical Babel

• Unfortunately, statisticians speak different languages , and often do not hear and/or do not understand each other.

• Two main approaches to the statistical inference are developing:

• Bayesian and

• Frequentist

• Frequentist inference is subdivided onto two main branches:

• Fisherian and

• Neyman-Pearsonian

• Users do not always differentiate them that leads to serious confusions.

• Two other approaches are also exist: Likelihood and Fiducial inferences.

• http://en.wikipedia.org/wiki/Frequentist_inference

28

Babel

29

Fundamental statistics principles

• Random sampling is the main principle of statistics.

• Randomness and the Law of Large Numbers ensure the sample representativeness.

• A sample is called representative if it reflects correctly the distribution from which the sample is taken.

• The main objective of statistics consists in analyzing random samples to get conclusions on the distributions from which they are drawn.

• Note that we do not need the term “population” which can be misleading.

30

Statistics with confidence

• Does Statistics enable us to trust to it?

• For instance, how to check is the die perfect (fair, ideal, symmetric) or not?

• The answer is provided by the Law of Large Numbers.

31

Simulation of the rolling a die: program SUStats http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html

32

A die was rolled 100 times in each of four independent simulations. Please, answer three questions: 1. Are the results of the rolling reproducible (i.e. are the histograms similar)? - Yes - No 2. What a form (shape) of the histogram and the underlying distribution we expect

for the results of rolling fair die? - Unimodal of a bell-form - Triangle - Uniform (rectangular) 3. Can we state that the die is fair? - Yes - No


33

A die was rolled 1 000 times in each of four independent simulations. Please, answer two questions: 1. Are the results of the rolling reproducible (are the histograms similar)? - Yes - No 2. Can we state that the die is certainly fair? - Yes - No


34

A die was rolled 10 000 times in each of four independent simulations. Please, answer two questions: 1. Are the results of the rolling reproducible (are the histograms similar)? - Yes - No 2. Can we state that the die is certainly fair (the histograms are certainly

rectangular and the entire distribution is uniform)? - Yes - No


35

Pease, keep in mind the last figure (number) n = 10 000 that gives reliable results. It is difficult to realize it in biomedicine, but it’s really reliable.

Lyrical digression

• If to ponder, it is the

• Pauli exclusion principle

• that provides a variety of forms

• of matter at all levels,

• from atoms to living beings,

• e.g., genetic and phenotypic (biochemical, physiological, morphological) variations.

36

Sample size

“She thought that a smaller sample size makes for more accurate results”

37

Sample sizes in physics, chemistry, biology and medicine

• Physicists and chemists works with the samples of different substances which contain 6∙1023 (the Avogadro constant) of particles (atoms or molecules) in 1 mole of the pure substance.

• Even 1 nanomole of given substance contains about 1014 such particles.

• These particles may be regarded as rather identical.

• However, we need not to forget that even on the atomic level there are several isotopes of a given chemical element.

• And some of them are radioactive.

• In medicine researchers are limited with the size of the world population which is less then 1010, specifically, about 7.257∙109.

• See real-time: http://www.worldometers.info/world-population/

• And human population are extremely heterogeneous.

38

Principal contradiction

• All people are dissimilar, even monozygotic (“identical”) twins.

• In such twins the differences in copy number variation (CNV), immunoglobulins, fingerprints are observed.

• Surely this fact is one of the main sources of the low reproducibility and predictive ability of the results in biomedicine.

• Thus, the genetic and phenotypic uniqueness of each person comes into contradiction with the statistical methodology, which requires to analyze large amounts (thousands or at least hundreds) of identical persons to achieve the certain conclusions.

39

What is the Low of Large Numbers? • If the probability P(A) of an event A is constant in all trials, then the larger n -

the number of trials (experiments, sample size),

• the closer the observed (empirical, experimental) relative frequency, f(A), of a given outcome (event) A converges to its expected (theoretical) probability P(A):

• This means that the frequencies become more and more stable and their

fluctuations become smaller and smaller.

• Corollary:

• Thus, we may not know the probability of an event A, but repeating the trial as much as possible, we can accept its observed frequency f(A) as a reliable statistical estimate of the unknown probability P(A)unkn.

• Statistics helps us to know the unknown.

• In Probability Theory probabilities are known, Statistics estimate them.

40

n

APAfP

“Reverse side” of the Law of Large Numbers

• Simultaneously along with the convergence of the frequency of an event A to its probability, the situation, when the frequency of the event will coincide exactly with its probability:

• becomes less probable

• i.e. the larger the number of trials the closer the probability of such an exact match converges to zero:

41

APAf

n

APAfPrP 0

Probability of the exact coincidence of the frequency f(A) with the probability P(A), e.g., fair coin tossing with P(A) = φ = 0,5

f(A) • 5/10

• 50/100

• 500/1 000

• 5 000/10 000

• 50 000/100 000

• 500 000/1 000 000

P[f(A)] • 0.25

• 0.080

• 0.025

• 0.0080

• 0.0025

• 0.00080

42

For the sake of clarity, the probability values are rounded to two significant figures.

Consequences of the Law of Large Numbers (LLN)

• According to the Law of Large Numbers the larger the Sample Size,

• the “better” (more accurate, more reliable) the Sample data reflects the distribution of Random Variable from which the Sample is drawn.

• Consequently, the larger the sample size, the more representative is the Sample.

• This is true, however, if and only if (iff) the Sample data are the realizations of the independent identically distributed (iid) Random Variables.

n

43

Statistical estimation

44

What are the main objectives of statistics?

• Statistical Estimation (of the parameters)

• Point and interval estimations

• Statistical Inference

– Testing Statistical Hypotheses

– Comparison of Models

• Statistical Associations

• Correlation and Regression

45

What is Estimator and what is Estimate?

• An “Estimator“ is a statistic that is used to infer the value of an unknown parameter in a statistical model.

• The parameter being estimated is sometimes called the estimand.

• In other words, an estimator is a rule for calculating an estimate of a given quantity based on observed data:

• thus the rule and its result (the estimate) are distinguished.

46

Two main kinds of Statistical Estimates

• Point Estimate – estimation by a single number.

• Intreval Estimate – estimation by an interval, which covers the value of the estimated parameter with given probability called confidence level.

47

The main logic of Statistical Estimation: Point Estimates

• Usually the parameter φunkn is unknown.

• The objective is to estimate it on the basis of observed statistical data

• x1, x2, …, xi, …, xn.

• The above values are regarded as realizations of corresponding iid random variables:

• X1, X2, …, Xi, …, Xn.

• Appropriate function of these random variable is chosen as an Estimator for the unknown parameter.

• Any such function is called “Statistic” and it also is a random variable.

• Calculated values of a chosen Estimator are called Estimates.

• Estimate is regarded as a realizations of given Estimator.

48

Compression of statistical information

• One of the most widely used statistic is a sample mean which plays a role of the Estimate of the mean value of the underlying distribution.

• It is calculated as:

• And it is generated by the Estimator:

• Here tilde “~” is a symbol of a random variable.

n

i

ixn

M1

1

n

i

iXn

M1

1 ~~

49

Example 1

Intrauterine growth restriction (IUGR) and interferon IFN-α/β

50

• Let consider one of the most common problem of statistical analysis of two independent samples.

51

IUGR – intrauterine growth restriction (old name “intrauterine growth retardation”)

• Foetuses of birth weight less than 10th percentile of those born at same gestational age

• or

• two standard deviations below the population mean are considered growth restricted.

• Note that the difiniton is based on statistical terms: 10th percentile and/or standard deviations.

• More strictly IUGR should refer to foetuses that are small for gestational age and display other signs of chronic hypoxia or failure to thrive.

• Approximately 3-5% of all pregnancies.

• IUGR also known as SGA (small for gestational age).

52

A comparision between normal and IUGR babies (Dr. M.C. Bansal)

53

IUGR

54

Normal and IUGR placenta (Dr. M.C. Bansal)

55

56

Levels of induced production of INF-α/β in 16 healthy mothers of healthy newborns and in 20 mothers of newborns with IUGR

(intrauterine growth restriction) (Koroleva L.I.). Data are ranked.

Healthy IUGR

Rank IFN-α/β,

IU/ml Rank

IFN-α/β, IU/ml

Rank IFN-α/β,

IU/ml Rank

IFN-α/β, IU/ml

1 38 9 92 1 104 11 144

2 42 10 93 2 121 12 146

3 58 11 94 3 123 13 147

4 59 12 101 4 123 14 149

5 70 13 103 5 127 15 151

6 71 14 115 6 130 16 153

7 81 15 159 7 132 17 162

8 86 16 170 8 134 18 168

9 134 19 171

10 140 20 173

Only three highlighted values in healthy group are overlapped with the values in IUGR group. Level of INF-a/b in IUGR group stochastically dominates that in healthy.

Exploratory and Pictorial Statistics. Visualization of the initial data and

their preliminary statistical descriptions:

histograms, box plots, dominance diagrams, etc.

57

58

Comparisons of histograms for the levels of induced production of INF-α/β in 16 healthy mothers of healthy newborns and in 20

mothers of newborns with IUGR. Free program PAST

http://folk.uio.no/ohammer/past

Comparisons of histograms and cumulative sample distributions for the levels of induced production of INF-α/β in 16 healthy mothers of healthy newborns and in 20

mothers of newborns with IUGR. Program XLSTAT http:\\www.xlstat.com

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200

Cu

mu

lati

ve r

ela

tive

fre

qu

en

cy

Cumulative distributions (Healthy / IUGR)

Healthy IUGR

0

0.005

0.01

0.015

0.02

0.025

0 50 100 150 200

De

nsi

ty

IFN-a/b, IU/mL

Histograms (IFN-a/b, IU/mL)

Healthy Normal(89.500,36.471)

IUGR Normal(141.600,18.323)

59

CDF – cumulative distribution functions and stochastic dominance

Program XLSTAT http:\\www.xlstat.com

• The level of induced IFN-a/b in IUGR patients (green line) stochastically dominates that for healthy mothers (blue line):

• X2 > X1

• Stochastic - randomly determined; having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200

Cu

mu

lati

ve r

ela

tive

fre

qu

en

cy

Cumulative distributions (IUGR / Healthy)

IUGR Healthy

60

Box-and-Whisker plot

61

Q1 – first quartile, Q3 – third quartile, IQR – interquartile range, σ – standard deviation.

Box-and-whisker plot for the levels of induced production of IFN-/ in 16 healthy mothers of healthy newborns and in 20

mothers of newborns with IUGR. Free program: Instat+ http://www.reading.ac.uk/ssc/n/n_instat.htm

62

Marks for outliers

95% confidence limits for medians

medians

What did the Box Plot say to the outlier? "Don't you dare get close to my whisker!!"

What is outlier?

• Outlier is an observation that is numerically distant from the rest of the data.

• They are often indicative of measurement (or registration) errors.

• For example, if for the arterial blood pressure the value 1100 is registered, this could be misprint: either 1 or 0 is rather redundant.

• Removing of outlier(s) is a controversial practice recommended in several textbooks and manuals.

• However, the possibility should be considered that the underlying distribution for the data is not approximately normal, having "fat (heavy) tails“ or representing a mixture of two or more different distributions.

• Mixture may comprise two identical distributions, but shifted relative to each other.

• Thus, removing of outlier(s) have to be based on the extra-statistical considerations.

• “I'm not an outlier; I just haven't found my distribution yet!”

63

Mixture analysis Program PAST

Component proportion

Mean, M Standard

Deviation, SD

0.88 78.8 22.5

0.12 164.5 5.5

64

Data in healthy group can be regarded as a mixture of two normal distributions. Their proportions are 88% and 12%. The major component has sample mean about M = 79 IU/mL and standard deviation SD = 23 IU/mL. The minor component has M = 165 IU/mL and standard deviation SD = 5.5 IU/mL. However, the sample size (n1 = 16) is too small to get certain conclusion.

Effect size

65

• Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals. Updated December 2013.

• iii. Statistics

• Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to judge its appropriateness for the study and to verify the reported results.

• When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals).

• Avoid relying solely on statistical hypothesis testing, such as P values, which fail to convey important information about effect size and precision of estimates.

• http://www.icmje.org/recommendations/

• Prediction probabilities and prediction intervals should be added.

66

• Over 300 medical and biomedical journals are guided with the ICMJE recommendations.

67

Effect Size, ES • Question of the clinical (practical) importance of the observed

• Effect Size (ES)

• is a key when interpreting results of biomedical investigations (e.g., clinical trials).

• Effect Size is defined as a quantitative reflection of the magnitude of some phenomenon that is used for the purpose of addressing a question of interest.

• Kelley K., Preacher K.J. On Effect Size. Psychological Methods, 2012; 17(2): 137–152

• ES can be the difference between mean values, different kind of ratios, correlation, association etc.

• ES can be expressed either in the real measurement units, or

• as standardized (nonmetric) quantity.

68

• Analyzing samples we get conclusions on the

distributions from which they are drawn.

• In the case of comparing two independent

distributions the simplest and useful measure of the

effect size is AUC (or AUROC) – Area Under (ROC-)

Curve which relates to Mann-Whitney U-statistics.

• One of its representation is so-called dominance

diagram.

69

70

170 159 115 103 101 94 93 92 86 81 71 70 59 58 42 38

104

121

123

123

127

130

132

134

134

140

144

146

147

149

151

153

162

168

171

173

Dominance diagram Program XLSTAT http:\\www.xlstat.com

He

alth

y

IUGR

Dominance diagram

71

Umin = 35 is a number of “plus” signs, and Umax = 285 is a number of “minus” signs, and obviously: Umin + Umax = 35 + 285 = n1 × n2 = 16 × 20 = 320

• For two independent random variables X and Y ,

• Θ = P(Y > X) + 1/2 P(Y = X)

• is advocated as a general measure of effect size to characterize the degree of separation (or, conversely, overlap) of their distributions.

• It is estimated by statistic

• θ AUC = Umax / (n1 × n2),

• derived by dividing the larger observed value Umax of the Mann–Whitney statistic by the product of the two sample sizes.

• It is equivalent to the observed value of AUC - area under the receiver operating characteristic (ROC) curve.

• It has been termed the ‘probability of concordance’, ‘common language effect size’ and ‘measure of stochastic superiority’.

72

AUC - area under (ROC-) curve

• In given rectangular matrix the total cell number is a product of the two sample sizes:

• n1 n2 = 20 16 = 320

• The observed maximum value of two additive components of Mann-Whitney U-statistics is the number of yellow cells in the matrix:

• Umax = 285

• So the point estimate for AUC is:

• AUC = Umax / (n1 n2) = 285/320 = 0.89

73

Interval estimation Researchers should wherever possible, base discussion and

interpretation of results on point and interval estimates

74

What is Confidence Interval?

• Frequentist’s Confidence Interval is a random interval that covers the estimated (unknown) value of a given Parameter with the specified probability.

• Such probability is called confidence level (or confidence coefficient).

75 75

CI

• If the experiment is repeated several times, the observed values for the limits of the Confidence Interval calculated from the observations will vary from sample to sample.

• Frequently, with the probability (1 - ), it will include (cover) the estimated unknown value of parameter, but with the probability it will inevitably miss the estimated value.

• How frequently the observed interval contains the parameter is determined by the confidence level (or confidence coefficient).

• Confidence level is chosen by the researcher in accordance with his intuition.

76 76

Frequentist’s Confidence Interval (CI)

2

2

1

unknupper

lowerunkn

upperunknlower

~

~

~~

P

P

P

77 77

The meaning of the Confidence Level

• The meaning of the term “confidence level” is that, if confidence intervals are constructed across many separate data analyses of repeated (and possibly different) experiments, the proportion of such intervals that contain the true value of the parameter will approximately match the confidence level.

• So, e.g., the 95% does not attach to the one frequentist CI, it attaches to “the proportion of such intervals”.

• When only single CI is obtained, it is unknown whether it is true or not.

• Again, we come to a conclusion about the need to repeat the experiment many times.

78

Bayesian confidence (credible) interval

79 79

1ULP ~

2

LP ~

2

UP ~

Significance Level α and Confidence Level (1 – α)

Significance level,

Confidence level, (1 - )

Reliability

0.05 95% Low

0.01 99% Medium

0.001 99.9% High

80 80

Confidence interval and statistical significance

Unknown estimated by given interval value θunkn does not differ statistically from the expected value θ.

Unknown estimated by given interval value θunkn is statistically significantly smaller than the expected value θ at the significance level α.

Unknown estimated by given interval value θunkn is statistically significantly larger than the expected value θ at the significance level α.

Expected value of θ 100(1 – α)% CI for the unknown value θunkn:

81

Statistical significance and practical (clinical) importance

Estimated unknown difference is statistically nonsignificant and clinically unimportant

Estimated unknown difference is statistically significant, but clinically unimportant

CI is too wide; perhaps sample size is too small

Estimated unknown difference is statistically significant and clinically important

Expected “null” value CI

82 82

Clinically indifferent zone or reference interval

Compact form for the joint presentation

of point and interval estimations

• Example:

– AUC point estimation: 0.89 – Lower limit of the 95% CI: 0.72 – Upper limit of the 99% CI: 0.96

• Compact record: • AUC θ = 0.720.890.96

• Louis T.A., Zeger S.L. Effective communication of standard errors and confidence intervals. Biostatistics, 2009; 10(1): 1–2.

• Newcombe’s spreadsheet: GENERALISEDMW.XLS http://medicine.cf.ac.uk/primary-care-public-health/resources/

83 83

Statistical inference using confidence interval

• Obtained 95% confidence interval (CI) does not cover the indifferent value AUCindiff = 0.5.

• This means that the unknown value of AUCunkn estimated with this interval statistically significantly differs from the indifferent value AUCindiff = 0.5 (under the significance level α = 0.05).

• Consequently, we can conclude that one of two comparing random variables stochastically dominates another.

• When the shapes of both distributions are similar we can interpret this result as the statistically significant deviation of the estimated Hodges-Lehmann shift parameter from its indifferent value ΔHLindiff = 0.

84

• Strictly speaking, widespread interpretation of the Mann-Whitney U-statistic as a measure of the difference between medians of the two comparing distributions is incorrect.

• Mann-Whitney statistic is the measure of stochastic dominance of one of two independent distributions (not their medians).

• When the shapes of both distribution are similar, than Mann-Whitney statistic becomes the basis for estimating the Hodges-Lehmann shift parameter.

85

86

170 159 115 103 101 94 93 92 86 81 71 70 59 58 42 38

104 -66 -55 -11 1 3 10 11 12 18 23 33 34 45 46 62 66

121 -49 -38 6 18 20 27 28 29 35 40 50 51 62 63 79 83

123 -47 -36 8 20 22 29 30 31 37 42 52 53 64 65 81 85

123 -47 -36 8 20 l999=22 29 30 31 37 42 52 53 64 65 81 85

127 -43 -32 12 24 26 33 34 35 41 46 56 57 68 69 85 89

130 -40 -29 15 27 29 36 37 38 44 49 59 60 71 72 88 92

132 -38 -27 17 29 l99=31 l95=38 39 40 46 51 61 62 73 u95=74 90 94

134 -36 -25 19 31 33 40 41 42 48 53 63 64 75 76 92 96

134 -36 -25 19 31 33 40 41 42 48 53 63 64 75 76 92 96

140 -30 -19 25 37 39 46 47 48 54 59 69 70 81 82 98 102

144 -26 -15 29 41 43 50 51 52 58 63 73 74 85 86 102 106

146 -24 -13 31 43 45 52 53 54 60 65 75 76 u999=87 88 104 108

147 -23 -12 32 44 46 53 54 55 61 66 76 77 88 89 105 109

149 -21 -10 34 46 48 55 HL=56 57 63 68 78 79 90 91 107 111

151 -19 -8 36 48 50 57 58 59 65 70 80 81 92 93 109 113

153 -17 -6 38 50 52 59 60 61 67 72 82 83 94 95 111 115

162 -8 3 47 59 61 68 69 70 76 81 91 92 103 104 120 124

168 -2 9 53 65 67 74 75 76 82 87 97 98 109 110 126 130

171 1 12 56 68 70 77 78 u99=79 85 90 100 101 112 113 129 133

173 3 14 58 70 72 79 80 81 87 92 102 103 114 115 131 135

Applying nonparametric confidence interval for the shift parameter to the comparison of the induced production of IFN-/ in healthy group and group with

IUGR. Program StatXact http://www.cytel.com/software-solutions/statxact

• Resulting Nonparametric Hodges-Lehmann point and interval estimates of the shift parameter are:

• ΔHL = 385674 IU/mL

• This 95% confidence interval doesn’t cover the indifferent value of the shift Δindiff = 0.

• So estimated with this interval unknown value of the shift Δunkn statistically significantly differs from 0 at the significance level α = 0,05.

• Therefore the induced production IFN-α/β in IUGR group is statistically significantly higher than in healthy group.

87

Applying parametric confidence interval for the mean difference to the comparison of the induced production of IFN-/ in healthy group and group with IUGR.

Free Program ESCI JSMS.xls http://www.latrobe.edu.au/psy/esci/

• Parametric point and interval estimates of the difference of two means are:

• Δ = 335271 IU/mL

• This 95% confidence interval doesn’t cover the indifferent value Δindiff = 0.

• So estimated with this interval unknown value of the difference Δunkn statistically significantly differs from 0 at the significance level α = 0,05.


88

ES Δ = 33.152.171.0 IU/mL; dC = 1.87; Student t = 5.58

Visualization of the comparison two meand using confidence interval for the mean difference Free Program ESCI JSMS.xls

http://www.latrobe.edu.au/psy/esci/

• Presented 95% confidence interval (rose triangle and vertical segment) for the mean difference doesn’t cover the indifferent value Δindiff = 0.

• So estimated with this interval unknown value of the difference Δunkn statistically significantly differs from 0 at the significance level α = 0.05.


89

Blue circles are observed values. Black dots and vertical segments are point and interval estimates of the unknown means. Rose triangle and vertical segment are estimates of their unknown difference.

Newcombe’s standardized effect size: δN or StAUC

• When σ1 = σ2 = σ, θ reduces to

• Φ(δN /√2)

• that is expressed in terms of the standard deviation σ.

• Here Φ is common notation for the CDF (Cumulative Density Function) of the standard Gaussian (normal) distribution.

• θ is more preferable than δN, as it is less depends on distributional assumptions, thus more satisfactory than the standardized difference.

90

Interrelationship between AUC and StAUC

AUC θ StAUC δN Size StAUC δN AUC θ

0.5 0 0 0.50

0.55 0.18 XS extra-small

0.25 0.57

0.6 0.36 0.5 0.64

0.65 0.55 S small

0.75 0.70

0.7 0.74 1 0.76

0.75 0.95 M medium

1.25 0.81

0.8 1.2 1.5 0.86

0.85 1.5 L large

1.75 0.89

0.9 1.8 2 0.92

0.95 2.3 XL extra-large

2.5 0.96

0.99 3.3 3 0.98

0.999 4.4 XXL extra-extra-

large

3.5 0.993

4 0.998 91

Standardized Cohen’s effect size, StES dC

pooleds

MMd 21

C

92

Standardized effect size (mean difference), StES dC; how it looks like

93

Verbal scale for the interpretation of the standardized Cohen’s effect size

Standardized Cohen’s effect size, dC

Interpretation

0 – 0,5 Negligibly small (worthless)

0,5 – 1,0 Small (weak)

1,0 – 1,5 Moderate

1,5 – 2,0 Large (strong)

2,0 – 3,0 Very large (very strong)

3,0 - Extremely large

94

Once more: Statistical significance and the Effect size

• Effect (difference, association, correlation, risk, benefit, etc.) can be statistically significant, however, its practical (e.g., clinical) importance can appeared to be worthless.

• “Statistically significant” does not imply “substantial”, “practically important”, “valuable”.

• Effects can be real, nonrandom, but nonetheless, negligibly small.

95

Confidence interval for the Standardized Cohen’s Effect Size dC. Free Program LePrep

http://www.univ-rouen.fr/LMRS/Persopage/Lecoutre/PAC.htm

96

Results: point estimates and 95% confidence

intervals for the three main effect sizes

• AUC – area under the ROC-curve:

• AUC = 0.720.890.96

• StAUC – Newcombe’s standardized AUC:

• StAUC = δN = 0.81.72.5

• StES – Cohen’s standardized difference of means:

• StES = dC = 1.11.92.7

• Verbal interpretation:

• with probability 95% the estimated unknown effect sizes can be interpreted as from medium to very large (strong).

97

Statistical predictions and reproducibility

“Prediction is very difficult, especially about the future”

98

Repeat!

• Often it is believed that if the “statistically significant” result is obtained, this excludes the need of repeating the experiment.

• Testing the significance of statistical hypotheses is a method to detect rare events which deserve further investigation.

• Fisher

99

Cumming G. The New Statistics: Why and How. Psychological Science, 2014; 25(1): 7 –29.

• Three problems are central:

• Published research is a biased selection of all researches;

• data analysis and reporting are often elective and biased; and

• in many research fields, studies are rarely replicated, so false conclusions persist.

100

Replication

• A single study is rarely, if ever, definitive; additional related evidences are required.

• Such evidences may come from a close replication, which, with meta-analysis, should give more reliable estimates than the original study.

• A more general replication may increase reliability and also provide evidence of generality or robustness of the original finding.

• We need increased recognition of the value of both close and more general replications, and greater opportunities to report them.

101

Reproducibility and predictive ability of P-values and confidence intervals (n = 32). CI dance.

Free program “ESCI PPS p intervals” http://www.latrobe.edu.au/psy/esci/. Cumming G. Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Persp. Psychol. Sci., 2008; 3: 286-300.

102















• Thus, it is risky to rich definite conclusion from a single experiment only.

• Any scientific investigation should be repeated manifold.

• And a reproducibility of the results must be studied.

103

Gigerenzer G. We need statistical thinking, not rituals. Behavioral and Brain Sciences, 1998; 21(2): 199-200

• A researcher cannot be unconcerned about:

• “what would happen if additional subjects were to be included into the

experiment?”,

• “what would be the conclusion for the data of these future subjects?”,

• “what would be the conclusion for the whole data?”, or

• “what would happen if this experiment were to be repeated?”

• Asking and answering such questions goes beyond the ritualized

statistical procedures, and is likely to influence the way the authors of

scientific papers interpret experimental findings and conduct their

experiments.

• Prediction probabilities are an unavoidable part of statistical thinking

and the time is come to take them seriously.

104

Prediction and confidence intervals. Program Instat+ http://www.reading.ac.uk/ssc/n/n_instat.htm

105

Reproducibility of the absolute effect size ES for the healthy and IUGR groups at α = 0.05 and (1 – α) = 0.95

106

95% confidence interval for ES Δ is from 33 to 71 IU/mL; 95% prediction interval for it is wider: from 25 to 78 IU/mL.

10-fold increasing sample size

107

If we will repeat the experiment 10 times independently, the prediction interval will become narrower and closer to the confidence level.

Prediction interval versus confidence interval

• Note that under 10-fold repetition of the experiment the 95% prediction interval becomes closer the observed 95% confidence interval.

• This is demonstration of the meaning of confidence interval as that one which covers the estimated effect size under manifold (infinite) repetitions of the experiment.

108

Reproducibility of the standardized Cohen’s effect size dC for the healthy and IUGR groups at α = 0.05 and (1 – α) = 0.95

109

95% confidence interval for StES dC is from 1.1 to 2.7 IU/mL; 95% prediction interval for it is wider: from 0.8 to 3.1 IU/mL.

10-fold increasing sample size

110

If we will repeat the experiment 10 times independently, the prediction interval will become narrower and closer to the confidence level.

Prediction probabilities, Prep, Psrep and Preprep

111

Probability of a same-sign effect is Prep = 1.0; of a same-sign and significant at α = 0.05 is Psrep = 0.99 and of a same-sign effect with Prep = 0.99 is Preprep = 0.98.

Reproducibility of the P-value when comparing healthy and IUGR groups at α = 0.05 and (1 – α) = 0.95

112

Observed Pval = 3∙10-6. 95% prediction interval for it will be from extremely small from 3∙10-11 to the moderate 0.01.

Probabilities of replication and prediction intervals

• Thus, it is predicted that when our experiment will be repeated, than the probability to receive the same sign for the mean difference (expressed as absolute effect size ES as well as Cohen’s standardized effect size dC) will be

• Prep = 1.00.

• And the probability to receive the difference of the same sign and statistically significant at the level α = 0.05 will be

• Psrep = 0.99.

• Moreover, it is predicted that in future repetition of the experiment, the P-value could lie in very wide 95% prediction interval from very low to rather medium:

• Pval = 3∙10-11 to Pval = 0.01.

113

Main statistical tools and their destination

• Bayes Factor (BF) → comparing statistical models and/or hypotheses

• P-value → statistical hypothesis testing

• Effect Size (ES) → practical (clinical) importance

• Confidence intervals (CI) → visualization of both, the estimates and the hypotheses testing

• Prediction Intervals (PI) → prediction of future repetitions

114

Bayes theorem in action: connecting prior and posterior

probabilities

115

Reverend Thomas Bayes (c. 1702 – April 17, 1761)

116

117

Bayes Factor

• Bayes factor differs principally from P-value (Рval).

• Base factor is not a probability in itself, but a ratio of probabilities, and it can vary from zero to infinity:

• BF01 = P(Dobs|H0) / P(Dobs|H1)

• BF10 = P(Dobs|H1) / P(Dobs|H0)

• This means that using Bayes factor provide not only testing the significance of the null hypothesis, but comparison of the probabilities to obtain the observed data under both hypotheses.

• However, for this we should have a better idea of the alternative hypothesis.

Amazing property of Bayes factor in terms of “odds”

118

What are the odds?

• The odds (in favor) of an event A is the ratio of the probability that the event will happen P(A) to the probability that the event will not happen P(Ā):

• O(A) = P(A) : P(Ā) = P(A) : [1 – P(A)]

• Conversely, the odds against an event A is the opposite ratio.

• Such a representation of the probability is familiar to geneticists.

• Famous Mendel’s ratio of 3 : 1 is a representation of the probabilities 3/4 and 1/4 in terms of odds.

119

Bayes factor BF in terms of odds

• Base factor not only shows how many times the probability P(Dobs|H0) differs from the probability P(Dobs|H1).

• It also shows how many times the posterior odds in favor of one hypothesis against the other (alternative) differ from their a prior odds.

• Conversely,

• BF01 = 1/BF10

• Thus, we observe an amazing property of Bayes factor:

• without knowing prior and posterior probabilities of both hypotheses, we can quantitatively compare their odds.

120

0

1

obs0

obs1

0obs

0obs10 :

HP

HP

DHP

DHP

HDP

HDPBF

|

|

|

|

Interpretation of credibility of Bayes factors BF10 and BF01

121

BF01 Evidence in favor of hypothesis Н0 against

hypothesis Н1

>10 000 Convincing

100 – 1 000 Very strong

30 – 100 Strong

10 – 30 Moderate

3 – 10 Weak

1 – 3 Negligible

BF10 Evidence in favor of hypothesis Н1 against

hypothesis Н0

John Arbuthnot 29.04.1667 – 27.02.1735

122

Number of Christened in London during 82 years

Year Boys Girls Year Boys Girls

1629 5218 > 4683 1650 2890 > 2722

1630 4858 > 4457 3231 > 2840

4422 > 4102 3220 > 2908

4994 > 4590 3196 > 2959

5158 > 4839 3441 > 3179

5035 > 4820 3655 > 3349

5106 > 4928 3668 > 3382

4917 > 4605 3396 > 3289

4703 > 4457 3157 > 3013

5359 > 4952 3209 > 2781

5366 > 4784 1660 3724 > 3247

1640 5518 > 5332 4748 > 4107

5470 > 5200 5216 > 4803

5460 > 4910 5411 > 4881

4793 > 4617 6041 > 5881

4107 > 3997 5114 > 4858

4047 > 3919 4678 > 4319

3768 > 3395 5616 > 5322

3796 > 3536 6073 > 5560

3363 > 3181 1669 6506 > 5829

1649 3079 > 2746

Year Boys Girls Year Boys Girls

1670 6278 > 5719 1691 7662 > 7392

6449 > 6061 7602 > 7316

6443 > 6120 7676 > 7483

6073 > 5822 6985 > 6647

6113 > 5738 7263 > 6713

6058 > 5717 7632 > 7229

6552 > 5847 8062 > 7767

6423 > 6203 8426 > 7626

6568 > 6033 7911 > 7452

6247 > 6041 1700 7578 > 7061

1680 6548 > 6299 8102 > 7514

6822 > 6533 8031 > 7656

6909 > 6744 7765 > 7683

7577 > 7158 6113 > 5738

7575 > 7127 8366 > 7779

7484 > 7246 7952 > 7417

7575 > 7119 8379 > 7687

7737 > 7214 8239 > 7623

7487 > 7101 7840 > 7380

7604 > 7167 1710 7640 > 7288

1690 7909 > 7302

• Total 484 382 > 454 041

• Total sum 938 423

123

Comparison of the frequentist and Bayesian results

• Testing homogeneity (independence) of the Arbuthnot data results in:

• Pval ≈ 10-8

• BF01 = 8∙10117

• From the frequentist point of view the heterogeneity of Arbuthnot data is statistically highly significant.

• From the Bayesian point of view the conclusion is diametrically opposite:

• To obtain such data is 8∙10117 times more likely under the hypothesis H0 on their homogeneity then under the alternative hypothesis H1 on their heterogeneity.

• Or:

• The posterior odds in favor of the null hypothesis against alternative hypothesis are 8∙10117 times higher then their prior odds.

124

Bayes Factor, online program Bayes Factor Calculators http://pcl.missouri.edu/bayesfactor

125

Output

• BF01 = 0.00018 and

• BF10 = 1/ BF01 = 5555.5

• It is 5555 times more likely

to obtain the value of the

Student t-test statistic t =

5.58 with df = 34 under the

H1: 0 than under H0: =

0.

• According to the verbal

scale such value of BF10 is

interpreted as convincing

evidence in favor of H1

against H0.

126

Summary

Statistical evidences

• AUC θ = 0.720.890.96

• StAUC δN = 0.81.72.5

• StES dC = 1.11.92.7

• ΔHL = 385674 IU/mL

• Δ = 335271 IU/mL

• BF10= 5555

• Pval = 3∙10-6

Statistical predictions

• 95% prediction intervals:

• From 0.8 to 3.1 IU/mL

• From 25 to 79 IU/mL

• From 3∙10-11 to 0.010

• Probability of replication:

• Psrep = 0.99

127

Example 2

TGT – Thrombin Generation Test

128

Castoldi E., Rosing J. Thrombin generation tests. Thrombosis Research, 2011; 127(Suppl. 3): S21–S25

• Parameters of the thrombin generation curve:

• LT – lag time, min

• TTP – time to peak, min

• PT – peak thrombin, nM

• ETP – endogenous thrombin potential, nM∙min

• V – maximum velocity of thrombin generation, V = PT / (TTP – LT), nM/min

129

Estimation of parameters of TGT, results of traditional NHST and effect sizes. n1 = 40, n2 = 53

LT, min ETP, nM∙min TTP, min PT, nM V, nM/min

RI 8.0 – 27.4 1290 – 2480 17 – 41 85 – 192 5.3 – 25.4

M1 14 16 17 1820 1900 1990 25 27 28 125 134 144 11 13 15

M2 15 17 19 1640 1740 1830 29 31 33 100 106 113 7.1 7.9 8.7

Pval 0.37 0.015 0.0012 3∙10-6 10-8

Effect sizes

ΔHL -3.3 -1.0 1.2 52 188 323 -7.3 -4.6 -1.8 14 28 40 3.3 4.6 6.0

SE Δ -3.4 -1.3 0.7 43 167 294 -7.1 -4.5 -2.1 17 28 39 3.4 5.1 6.7

AUC θ 0.44 0.55 0.67 0.55 0.67 0.77 0.68 0.70 0.79 0.66 0.77 0.85 0.73 0.83 0.90

StAUC δN -0.61 -0.20 0.22 0.19 0.63 1.04 -1.13 -0.72 -0.28 0.53 1.06 1.48 0.89 1.36 1.80

StES dC -0.66 -0.25 0.16 0.10 0.52 0.94 -1.15 -0.73 -0.30 0.65 1.09 1.53 0.89 1.35 1.80

n1 and n2 – sample sizes of the control and CAD groups; RI – nonparametric reference interval; М1 and М2 – sample means; Pval – P-value; ΔHL – Hodges-Lehmann shift estimate; Δ = М1 – М2 – effect size in real units; θ - area under ROC-curve; δN and dC

– Newcombe’s and Cohen’s standardized effect sizes. Programs: Reference Value Advisor, PAST, StatXact, GENERALIZED.xls, ESCI-JSMS.xls, LePrep.

130

Informativeness of TGT parameters 53 CHD patients and 40 people without clinical manifestations of

coronary heart disease (data by Berezovskaya G.A.)

131

dC – standardized Cohen’s effect size, Pval – Р-value, BF10 – Bayes factor for comparison of odds in favor of H1 versus H0, Psrep – probability of statistically significant effect of the same sign (direction) in a replication, Power – “achieved” power, n1 = n2 – minimum sample sizes for replication. Programs: ESCI-JSMS.xls, Online BF Calculator (http://pcl.missouri.edu/bayesfactor), LePrep, G*Power

Syndrome of statistical leniency and credulity

Fallacies and Confusions of Null Hypothesis Significance Testing

(NHST) and P-value

“What does a statistician call it when the heads of 10 rats are cut off and 1 survives?

- Nonsignificant.”

132

P-value

• P-value is the most controversial concept in statistics.

• Many textbook authors and the majority of experimenters do not understand what its final product – a P-value – actually means (Gigerenzer, 1988).

• The concept of a P-value lies so far from the intuitive understanding that no ordinary person can hold it in memory.

• ‘‘We rely too much on P values, and most of us really don’t have a clue what they mean.’’

• Lai J., Fidler F., Cumming G. Subjective p intervals: Researchers underestimate the variability of p values over replication. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 2012; 8: 51-62.

133

What is P-value? What is null hypothesis H0?

• A P-value is the probability of observing data as or more extreme as the actual outcome when the null hypothesis is true.

• When testing null hypothesis we transform data into a test statistic.

• Then the P-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

• Usually the null hypothesis is a statement of 'no effect' or 'no difference'.

• The Null Hypothesis is often denoted H0 (read “H-nought”)

134

Null Hypothesis Significance Testing Waltz

• The P value is at the heart of the most common approach to data analysis – Null Hypothesis Significance Testing (NHST).

• Think of NHST as a waltz with three steps:

• (i) State a null hypothesis: that is, there is no effect.

• (ii) Calculate the p value, which is the probability of getting results like ours and more extreme – if the null hypothesis is true.

• (iii) If Pval is sufficiently small, reject the null hypothesis and sound the trumpets:

• our effect is not zero, it's statistically significant!

• Generations of students have been inducted into the rituals of .05 meaning "significant", and .01 "highly significant".

135

Р-value, Рval

• Thus, by definition, the P-value (Pval) is the conditional probability of obtaining the observed value of difference (dobs) and all other larger or less probable values (D ≥ dobs|H0), when the null hypothesis is true:

• Pval = P(D ≥ dobs|H0).

• In terms of the statistical hypothesis testing, P-value is:

• The probability to obtain the modulus of observed value |tobs| of the test statistic T and all other larger or less probable values (i.e., the values even more deviating from the expected one)

• under assumption that the null hypothesis H0 is true:

•

• Pval = P(|T| ≥ |tobs.| | H0).

• Note that the “less probable values” are not observed.

• We infer them out of all possible values in the frame of the chosen (null) model.

136

• A P-value is usually interpreted as a measure of how much evidence we have against the null hypothesis, how much is contradiction between null hypothesis and observed data.

• The null hypothesis, traditionally represented by the symbol H0, represents the hypothesis of no change or no effect.

• The smaller the P-value, the more (stronger) evidence we have against H0.

137

What is Test Statistic? • Test statistic is a statistic used for the testing the given null

hypothesis.

• Example: Student t-test statistic:

• In such a case testing the null hypothesis H0 on the equality of two independent means (H0: M1 – M2 = 0) is reduced to the testing the null hypothesis on the t = 0.

• When this hypothesis is true, than the distribution of the t-statistic is known.

• Namely, it is the Student t-distribution.

• This distribution has a single parameter called degrees of freedom, df.

22121

21

nndf,s~

M~

M~

t~

MM

138

William Sealy Gosset (June 13, 1876–October 16, 1937) is famous as a statistician, best known by his pen name Student and for his work

on Student's t-distribution.

139

n1 = 5, n2 = 7, df = 10, t = 1,5 P = 0,16 – the difference is statistically nonsignificant

140

http://ftparmy.com/103097-decision-visualizer.html

n1 = 5, n2 = 7, df = 10, t = 3,0 P = 0,013 – the difference is statistically significant at

the significance level α = 0,05, but not at 0,01

141

Searching the threshold for the P-value: is it possible?

• When small P-value is observed, the intuitive (extrastatistical) temptation appears to reject null hypothesis H0.

• However, there is no statistical reason what P-value would be regarded as sufficiently small to reject H0 safely.

• Once again, such decision is extrastatistical.

• In practice, decision to reject or accept H0 must depend on circumstances.

• In each specific (concrete) situation researcher should make her/his choice by oneself.

142

143

Traditional interpretation of the P-values (Pval)

(and their Michelin star scale)

143

P-value (Pval) Statistical significance Michelin stars

> 0,05 Nonsignificant

0,05 – 0,01 Moderately significant *

0,01 – 0,001 Significant **

0,001 – 0,0001 Highly significant ***

< 0,0001 Extremely significant ****

Four stars value 0,0001 was introduced recently by Harvey J. Motulsky: http://www.graphpad.com/guides/prism/6/statistics/index.htm?interpreting_a_small_p_value_from_an_unpaired_t_test.htm

Tyranny and/or hypnosis of the figures 0.05 and 95%

• Unfortunately, as a threshold the significance level α = 0.05 is most commonly used.

• Too often the overcoming this threshold level (Pval < 0.05) solely in a single experiment is regarded as sufficient for the decision to reject the null hypothesis and conclude on the statistical significance of the observed effect.

144

Andrey Nikolaevich Kolmogorov (25 April 1903 – 20 October 1987)

• In statistics, the recommended significance level varies from 0.05 for preliminary orientation experiments to 0.001 for important ultimate conclusions, but the attainable reliability of probability conclusions is often much higher.

• Thus, the principal conclusions of statistical physics are based on the neglect of probabilities of an order less than 10−10.

• (1951)

145

http://www.encyclopediaofmath.org/index.php/Probability

Sterne J.A.C., Davey Smith G. Sifting the evidence –

what’s wrong with significance tests? BMJ, 2001; 322: 227-231. Cited by 763

• Presently, several other authors echo to Kolmogorov:

• P-value closer to 0.05 is not a strong evidence against null hypothesis.

• As a strong evidence against Н0 Pval < 0.001 should be regarded.

• In addition to P-values it is strongly recommended to present confidence intervals for the effect size.

146

“Flexible” P-values

• In fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses;

• he rather gives his mind to each particular case in the light of his evidence and his ideas.

•

• Fisher R. A. Statistical Methods and Scientific Inference, 1956, pages 41-42.

147

Sir Ronald Aylmer Fisher 17 Feb 1890 - 29 July 1962

148

Warrning

• Usually P-value is interpreted as a measure for the evidence given by the available data against the null hypothesis.

• Strictly speaking, however, it is not a measure in mathematical sense.

• It does not possess the additivity property, and moreover,

• it does not satisfy to two the more important principle of the statistical theory – The Likelihood Principle and the P-postulate.

149

Likelihood Principle

• Verbosely, the Likelihood Principle is a statement that statistical analysis must operate with that and only that data which are actually obtained in the experiment.

• However, for the calculation of Р-value (as it follows from its definition), not only the observed experimental data are used, but all other, less probable, which were not observed in fact.

150

Р-postulate

• To serve as real and adequate measure of the statistical evidence, Р-value should satisfy the simple rule (postulate) according to which the same Р-values have to present equal evidences against the null hypothesis.

• This rule is called «Р-postulate».

• Obviously, this minimal requirement is not met.

•

• Wagenmakers E.-J. A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 2007; 14(5): 779-804.

151

Р-postulate

• Intuitively one can recognize that Рval = 0.01 in the experiment with 10 observations will not demonstrate the same evidential strength as Рval = 0.01 in the experiment with 300 observations.

• Equally, Рval = 0.001, obtained in one experiment and Рval = 0.01 in another does not imply that the effect observed in the first experiment is 10 times more evidential than in the second.

152

P-value is the realization of corresponding random variable P*

• P-value is an observed value of the corresponding random variable

• P*

• When null hypothesis H0 is true, then Pval has so called (continuous) standard uniform distribution, that is uniform distribution on the interval [0; 1]:

• P* ~ Uni[0; 1].

153

P-value distributions Pike N. free spreadsheet: FDR.xls http://www.webcitation.org/5rxSzU7qL

Δ = μ1 – μ2 = 0;

χ2 = 390,6; df = 400; Pval = 0,62 Δ = μ1 – μ2 = 10;

χ2 = 1348,8; df = 400; Pval = 4∙10-101

154

0

20

40

60

80

100

120

0.0

5

0.1

0

0.1

5

0.2

0

0.2

5

0.3

0

0.3

5

0.4

0

0.4

5

0.5

0

0.5

5

0.6

0

0.6

5

0.7

0

0.7

5

0.8

0

0.8

5

0.9

0

0.9

5

1.0

0

Fre

quency o

f valu

es in r

ange

p-value defining upper limit of range

Frequency distribution of p-values

Observed frequency Expected frequency

0

2

4

6

8

10

12

14

16

0.0

5

0.1

0

0.1

5

0.2

0

0.2

5

0.3

0

0.3

5

0.4

0

0.4

5

0.5

0

0.5

5

0.6

0

0.6

5

0.7

0

0.7

5

0.8

0

0.8

5

0.9

0

0.9

5

1.0

0

Fre

quency o

f valu

es in r

ange

p-value defining upper limit of range

Frequency distribution of p-values

Observed frequency Expected frequency

These are histograms obtained with 200 simulations.

Reproducibility and predictive ability of P-values and 95% confidence intervals (n = 32). Dance of Pval

Free program “ESCI PPS p intervals” http://www.latrobe.edu.au/psy/esci/. Cumming G. Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Persp. Psychol. Sci., 2008; 3: 286-300.

155

Reproducibility and predictive ability of P-values and 95% confidence intervals (n = 32). Dance of Pval

Free spreadsheet “ESCI PPS p intervals” http://www.latrobe.edu.au/psy/esci/. Cumming G. Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Persp. Psychol. Sci., 2008; 3: 286-300.

156

Reproducibility of the P-value when comparing healthy and IUGR groups at α = 0.05 and (1 – α) = 0.95

157

Observed Pval = 3∙10-6. 95% prediction interval for it will be from extremely small from 3∙10-11 to the moderate 0.01.

Popular temptation

• It is conventional to interpret the quintessence of traditional (frequentist) conclusions from the statistical hypotheses testing as:

• The less P-value, the stronger is evidence (which is presented by the data) against null hypothesis H0 the bigger is a reason to doubt in H0.

• Hence, whether intentionally or not (and seems rather naturally), the temptation appears to interpret P-value as a probability of the null hypothesis.

158

Popular delusion • P-value is not a probability of null hypothesis!

• P-value is calculated

• under the assumption

• that null hypothesis H0 is true:

• Pval = P(|D| ≥ |dobs||H0),

• Hence, P-value cannot be a probability of null hypothesis:

• P{D|H0} ≠ P{H0|D}

• Collection of other fallacies about P-value see, e.g.:

• http://en.wikipedia.org/wiki/P-value

• Goodman S. A dirty dozen: Twelve P-value misconceptions. Semin. Hematol., 2008; 45: 135-140

159

Calibration of P-values

• Vovk V. G. A logic of probability, with application to the foundations of statistics. Journal of

the Royal Statistical Society. Series B (Methodological), 1993; 55(2): 317-351. • Sellke T., Bayarri M.J., Berger J.O. Calibration of p values for testing precise null hypotheses.

The American Statistician, 2001; 55(1): 62-71. Cited by 321 • When

• - lower bound for the probability of the null hypothesis H0

01

010

1 BF

BFDHP

|

160

eP 1val

valval01 lnPePBF

161

The “price” of P-values

Observed P-value

Upper limit of 80% intreval for

Pval

Lower limit for the probability of hull

hypothesis P(H0)

Upper limit for the probabililty of

repeat Рrepr

0.05 0.44 ≥ 29% < 50%

0.01 0.22 ≥ 11% < 73%

0.001 0.07 ≥ 1.8% < 90%

Sellke T., Bayarri M.J., Berger J.O. Calibration of p values for testing precise null hypotheses. The American Statistician, Vol. 55, No. 1. (2001), pp. 62-71. Goodman S.N. A comment on replication, p-values and evidence // Statistics in Medicine, 1992. – Vol. 11. – P. 875-879. Cumming G. Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better // Perspectives on Psychological Science, 2008. – Vol. 3. – No. 4. – P. 186-300.

161

The problem with p values: how significant are they, really? November 12th, 2013 Geoff Cumming

http://phys.org/wire-news/145707973/the-problem-with-p-values-how-significant-are-they-really.html

A p value of 0.05 has been the default ‘significance’ threshold for nearly 90 years … but is that standard too weak? Martin_Heigan

162

Funny metaphor

• “Perhaps p values are like mosquitos.

• They have an evolutionary niche somewhere and no amount of scratching, swatting, or spraying will dislodge them”.

• Campbell J.P. Editorial: Some remarks from the outgoing editor. Journal of Applied Psychology, 1982; 67: 691-700

163

• The usefulness of P-values is quite limited, and we continue to suggest that these procedures be euthanized.

• Anderson D.R., Burnham K.P. Avoiding pitfalls when using information-theoretic methods. The Journal of Wildlife Management, 2002; 66(3): 912-918.

164

On seduction: • Yes, the P-value can seduce.

• It is sexy and we can be blinded.

• A significant P-value can perplex our thinking, where we simply get too excited and forget to look at the actual effect size.

• Does that < 0.05 really matter when the effect size is small?

• The study which concluded that the "internet is changing the dynamics and outcomes of marriage itself“ can be an example.

• This study showed that those who meet their spouses online are less likely to divorce and more likely to have high marital satisfaction (of course with very significant P-values).

• However, the effect size was very very small where happiness, for example, barely moved from 5.48 to 5.64.

• So, do not sign up for match.com thinking that you may be happier with your spouse.

165

http://www.pnas.org/content/110/25/10135.full

http://www.pnas.org/content/110/25/10135.full

Meaning of the P-value: Publish or Perish

166

Pee-value (http://wmbriggs.com/blog/?p=9338)

167

Statistics is the only field in which men boast of their wee p-values

• Revised standards for statistical evidence

• Valen E. Johnson

• PNAS, 2013; 110(48): 19313–19317

• Supporting Information:

• Johnson 10.1073/pnas.1313476110

168

Evidence thresholds γ and size of corresponding significance tests α

169

Revised standards for statistical evidence

• A simple strategy for improving the replicability of scientific

research includes the following steps:

• (i) Associate statistically significant test results with P values

that are less than 0.005.

• (ii) Associate highly significant test results with P values that

are less than 0.001 (cf. Kolmogorov) and even 0.0001.

• (iii) Report the Bayes factor in favor of the alternative

hypothesis and the default alternative hypothesis that was

tested.

170

Revised standards for statistical evidence

• (iv) BF10 > 30 or even > 100 should be considered as strong and convincing evidence in favor of alternative hypothesis H1.

• Proposed modifications of common standards of evidence intend to reduce the rate of nonreproducibility of scientific results by a factor of 5 or greater.

• Certainly, the larger sample sizes are required.

171

Minimum sizes for two independent samples with non-overlapping values required to achieve the lower confidence

limits for two measures of the effect size: AUCL and SESL

Lower confidence limits for the effect size measured

with: Confidence levels

AUCL StAUCL 0.95 0.99 0.999

0.80 1.2 10 17 27

0.90 1.8 21 35 56

0.95 2.3 40 69 111

0.99 3.3 194 334 545

0.999 4.4 1923 3320 5418

Extrapolated using Newcombe’s free spreadsheet VISUALISETHETA.xls http://medicine.cf.ac.uk/primary-care-public-health/resources/

172

Джон Уайлдер Тьюки (John Wilder Tukey, 16.04.1915 — 26.07.2000)

• Any research should be at least two-staged.

• First stage – exploratory (preliminary, pilot, hypotheses generating) study.

• Second stage – confirmatory study.

• The second stage is designed on the basis of the results obtained at the first stage.

173

Conclusions

• Bad reproducibility of experimental results becomes a systemic problem in biomedicine.

• One of the main reason of this is inadequate statistical analysis.

• Statistical analysis should be comprehensive harmonizing statistical evidences and predictions as well as frequentist and Bayesian approaches.

• It is insufficient to carry out the null hypothesis significance testing (NHST) reporting P-values.

174

Conclusions (continued)

• Statistical significance doesn’t mean clinical importance.

• Effect size with confidence and prediction intervals should be reported.

• Experiments an/or observations should be repeated many-many times and their agreement should be investigated.

• The best way is to repeat the experiments independently in different laboratories (in different countries).

175

Editorial politics

• Journal editors and reviewers should not accept for publications the papers if they report results of a single experiment and no results of the independent replication.

• Experts on statistics should be included in the editorial boards.

• Reviewers should be obliged to re-examine all the calculations.

• For this reason the free access to the initial (“raw”) data should be ensure.

• Transparency and openness are cornerstones of the scientific method.

176

Francis Galton, 1901

• “I have begun to think that no one ought to publish biometric results, without lodging a well-arranged and well-bound copy of his data in some place where it should be accessible, under reasonable restrictions, to those who desire to verify his work.”

• Galton F. Biometry. Biometrika, 1901; 1(1): 7-10.

• Galton’s suggestion of a store data had been revived by Professor Julian Huxley, and

suggestion made for storing measurements

in the British Museum of Natural History.

177

• One of the most common and leading to the biggest disaster of temptations is tempting with the words: "Everybody does it"

• Leo Tolstoy

178

Books on Bayesian biostatistics

179

180

Lesaffre E., Lawson A. Bayesian Biostatistics. Bayesian Biostatistics. 2012. Wiley. 534 p.

Broemeling L.D. Bayesian Biostatistics and Diagnostic Medicine. 2007. CRC Press, 216 p.

181

Kruschke J. Doing Bayesian Data Analysis. 2010. Academic Press, 672 p.

182

Downey A.B. Think Bayes: Bayesian Statistics Made Simple. Version 1.0.1, 2012. Green Tea Press: Needham, Massachusetts, 195 p.

Albert J. Bayesian Computation with R. Series: Use R! 2nd ed. 2009, Springer, 299 p.

Free Software • Educational: SUStats,

http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html

• WinStat http://math.exeter.edu/rparris/winstats.html

• SOCR http://www.socr.ucla.edu/

• Research: R http://cran.r-project.org/

• PAST http://folk.uio.no/ohammer/past/

• Instat+ http://www.reading.ac.uk/ssc/n/software/instat/337/Instat+_v3.37.msi

• Online Bayes Factor Calculator http://pcl.missouri.edu/bayesfactor

• LePAC and LePrep http://www.univ-rouen.fr/LMRS/Persopage/Lecoutre/PAC.htm

• G*Power http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/

• Reference Value Advisor http://www.biostat.envt.fr/spip/spip.php?article63

• Newcombe’s spreadsheets http://medicine.cf.ac.uk/primary-care-public-health/resources/

• Cumming’s spreadsheets ESCI http://www.latrobe.edu.au/psy/esci/

• Harold Kaplan statistical pages http://printmacroj.com/statistics.htm

• Commercial:

• StatXact http://www.cytel.com/software-solutions/statxact

• XLStat http:\\www.xlstat.com

183

Commercial Software • StatXact http://www.cytel.com/software-solutions/statxact

• XLStat http:\\www.xlstat.com

• MedCalc https://www.medcalc.org/

• GraphPad Prism http://www.graphpad.com/

• StatsDirect http://www.statsdirect.com/

• Expensive monsters:

• SAS http://www.sas.com/en_us/home.html

• IBM SPSS http://www-01.ibm.com/software/analytics/spss/

• STATISTICA http://www.statsoft.com/

• John C. Pezzullo’s comprehensive list of statistical software: http://statpages.org/

184

Thank you for your attention

Slides are freely available to all

Nikita N. Khromov-Borisov Department of Physics, Mathematics and Informatics

Pavlov First Saint Petersburg State Medical University

[email protected]

+7-952-204-89-49; +7-921-449-29-05 http://independent.academia.edu/NikitaKhromovBorisov

185

harmonizing statistical evidences and predictions

Science

irreproducible science

history of science

lung cancer risk

statistical science

essence of science

risk factors

ovarian cancer

repetition of results