ten statistics commandments that almost never should be broken

5
Ten Statistics Commandments That Almost Never Should Be Broken Thomas R. Knapp, Jean K. Brown Correspondence to Thomas R. Knapp E-mail: [email protected] Thomas R. Knapp Professor Emeritus University of Rochester and The Ohio State University 78-6800 Alii Dr #10 Kailua-Kona, HI 96740 Jean K. Brown Professor and Dean Emerita University at Buffalo State University of New York Buffalo, NY Abstract: Quantitative researchers must choose among a variety of statistical methods when analyzing their data. In this article, the authors identify ten common errors in how statistical techniques are used and reported in clinical research and recommend stronger alternatives. Useful references to the methodological research literature in which such matters are discussed are provided. ß 2014 Wiley Peri- odicals, Inc. Keywords: statistics; statistical signicance; measurement; research design Research in Nursing & Health, 2014, 37, 347351 Accepted 4 June 2014 DOI: 10.1002/nur.21605 Published online 29 June 2014 in Wiley Online Library (wileyonlinelibrary.com). The realities of clinical nursing research are often at odds with the assumptions and peer expectations for statistical analyses. Random sampling is often impossible, so many clinical researchers employ sequential convenience sam- pling over several years and rely on multiple clinical sites for participant accrual. Many sample sizes are smaller than those determined by a priori power analysis because of funding limitations, changes in clinical practice, and publish or perishpressure. In addition, ordinal rather than interval- level measures are used for many study variables. Last, peer reviewers and journal editors sometimes make demands that are not always consistent with ideal statistical desiderata. So what is the clinical researcher to do? In the spirit of Knapp and Brown (1995), Ten measurement command- ments that often should be broken,we present ten statis- tics commandments that almost never should be broken. These are not the only statistics commandments that should be obeyed, but these often have serious consequen- ces when not followed. The Ten Commandments Thou Shalt Not Carry Out Signicance Tests for Baseline Comparability in Randomized Clinical Trials It has become increasingly common in randomized clinical trials to test the baseline comparability of the participants who have been randomly assigned to the vari- ous arms. There are several problems with this: a. It reects a mistrust of probability for providing balance across study arms. The signicance test or condence interval for the principal analysis takes into account pre- experimental differences that might be attributable to chance. b. It involves the selection of a signicance level that is often not based upon the consequences of making a Type I error and therefore is arbitrary. c. If several baseline variables are involved in the test, it is likely that there will be at least one variable for which the difference(s) is (are) statistically signicant. If that were to happen, should the researcher ignore it? Use such variables as covariates in the principal analysis after the fact? Both of those are poor scientic practices. Covari- ates should be chosen based upon their relationships to the dependent variable, not because of their imbalance at baseline. For more on this matter, see Senn (1994) and Ass- mann, Pocock, Enos, and Kasten (2000). But why do some people break this commandment? Because of small samples, for which random assignment might not produce pre-experimental equivalence? Because editors and/or reviewers require it? A better strategy is to create blocks of participants in small sequential groups and then randomly assign participants within each block, in order to improve the comparability of the study C 2014 Wiley Periodicals, Inc.

Upload: jean-k

Post on 04-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Ten Statistics Commandments That AlmostNever Should Be Broken

Thomas R. Knapp, Jean K. Brown

Correspondence to Thomas R. Knapp

E-mail: [email protected]

Thomas R. Knapp

Professor Emeritus

University of Rochester and

The Ohio State University

78-6800 Alii Dr #10

Kailua-Kona, HI 96740

Jean K. Brown

Professor and Dean Emerita

University at Buffalo

State University of New York

Buffalo, NY

Abstract: Quantitative researchers must choose among a variety of statisticalmethods when analyzing their data. In this article, the authors identify ten commonerrors in how statistical techniques are used and reported in clinical research andrecommend stronger alternatives. Useful references to the methodological researchliterature in which such matters are discussed are provided. � 2014 Wiley Peri-odicals, Inc.

Keywords: statistics; statistical significance; measurement; research design

Research in Nursing & Health, 2014, 37, 347–351

Accepted 4 June 2014

DOI: 10.1002/nur.21605

Published online 29 June 2014 in Wiley Online Library (wileyonlinelibrary.com).

The realities of clinical nursing research are often at oddswith the assumptions and peer expectations for statisticalanalyses. Random sampling is often impossible, so manyclinical researchers employ sequential convenience sam-pling over several years and rely on multiple clinical sitesfor participant accrual. Many sample sizes are smaller thanthose determined by a priori power analysis because of

funding limitations, changes in clinical practice, and “publishor perish” pressure. In addition, ordinal rather than interval-level measures are used for many study variables. Last,peer reviewers and journal editors sometimes makedemands that are not always consistent with ideal statisticaldesiderata.

So what is the clinical researcher to do? In the spiritof Knapp and Brown (1995), “Ten measurement command-ments that often should be broken,” we present ten statis-tics commandments that almost never should be broken.These are not the only statistics commandments thatshould be obeyed, but these often have serious consequen-ces when not followed.

The Ten Commandments

Thou Shalt Not Carry Out Significance Testsfor Baseline Comparability in RandomizedClinical Trials

It has become increasingly common in randomizedclinical trials to test the baseline comparability of the

participants who have been randomly assigned to the vari-ous arms. There are several problems with this:

a. It reflects a mistrust of probability for providing balanceacross study arms. The significance test or confidenceinterval for the principal analysis takes into account pre-experimental differences that might be attributable tochance.

b. It involves the selection of a significance level that isoften not based upon the consequences of making aType I error and therefore is arbitrary.

c. If several baseline variables are involved in the test, it islikely that there will be at least one variable for which thedifference(s) is (are) statistically significant. If that wereto happen, should the researcher ignore it? Use suchvariables as covariates in the principal analysis after thefact? Both of those are poor scientific practices. Covari-ates should be chosen based upon their relationships tothe dependent variable, not because of their imbalanceat baseline.

For more on this matter, see Senn (1994) and Ass-mann, Pocock, Enos, and Kasten (2000).

But why do some people break this commandment?Because of small samples, for which random assignmentmight not produce pre-experimental equivalence? Becauseeditors and/or reviewers require it? A better strategy isto create blocks of participants in small sequentialgroups and then randomly assign participants within eachblock, in order to improve the comparability of the study

�C 2014 Wiley Periodicals, Inc.

arms (see Efird [2011]). For example, a randomized clinicaltrial with a desired sample of 100 could be divided into10 blocks of 10 participants each. Within each block of 10,5 participants would be randomly assigned to the experi-mental treatment group and 5 to the usual care group.

Thou Shalt Not Pool Data Across ResearchSites Without First Testing for “Poolability”

In a multi-site investigation, it is common that researcherscombine the data across sites in order to get a larger sam-ple size, without first determining the extent to which it isappropriate to do so. For example, in a study of the relation-ship between age and pulse rate, it is possible that the rela-tionship between those two variables might be quitedifferent at Site A than at Site B. The subjects at Site Amight be older and have higher pulse rates than the sub-jects at Site B. If the data are pooled across sites, the rela-tionship could be artificially inflated. (The same thinghappens if you pool the data for females and for males ininvestigating the relationship between height and weight.Males are generally taller and heavier, which if viewed on ascatter diagram would result in stretching out and thinningout the pattern to give a better elliptical fit.)

In randomized clinical trials involving two or moresites, the researcher must first test the treatment-by-siteinteraction effect. Only if that effect is small, is it justified topool the data across sites and test the main effect of treat-ment (see Kraemer [2000]).

Thou Shalt Not Report a Variety of p ValuesAfter Having Chosen A Priori a ParticularSignificance Level or Confidence Level inOrder to Determine Sample Size

Cohen (1988, 1992) initially urged researchers to use a (atolerable probability of making a Type I error, usually .05),

desired power (1�b, the probability of not making a Type IIerror, usually .80), and the alternatively hypothesized effectsize (usually “medium”' and clinically observable) in order todetermine the appropriate sample size for a given study.Having carried out such a study, researchers need onlycompare their obtained p value with the pre-specified a inorder to claim statistical significance (if p is less than a) orstatistical non-significance (if it is not).

Later, Cohen (1994)1 and others regretted putting somuch emphasis on significance testing and argued thatconfidence intervals (usually 95%) should be used instead.The two approaches are very closely related. Statistical sig-nificance (or non-significance) is demonstrated using confi-dence intervals because if the null hypothesized parameteris inside the 95% confidence interval, the finding is not

statistically significant at the .05 level, and if the null hypoth-esized parameter is outside the 95% confidence interval,the finding is statistically significant at the .05 level. SeeCumming and Finch (2005) for guidelines regarding theproper use of confidence intervals. The confidence intervalapproach was slow to catch on in the nursing research liter-ature, but its use has noticeably increased.

More recently, there has been a tendency to report95% confidence intervals along with the actual magnitudesof the p values. For example, the 95% confidence intervalfor a population Pearson correlation coefficient might besaid to be from .45 to .55, and p might be expressed asless than .01, or equal to .0029 or another number. Thereare several reasons why this is bad practice.

First of all, if a particular value of a (for hypothesistesting) or 1�a (for interval estimation) has been specifiedin order to determine an appropriate sample size, there isno need to be concerned about the size of p, with or withoutthe use of single, double, and triple asterisks (see Slakter,Wu, and Suzuki-Slakter [1991]).

Second, p is not a measure of the strength of aneffect. A correlation coefficient or a difference betweenmeans is appropriate for that and should always be reported.

Third, if the credibility of the inference depends upon.05 (for 95% confidence) on the one hand and upon .01 (forstatistical significance) on the other hand, which should thereader care about? Finally, if more than one inferential pro-cedure has been carried out, with only 95% confidenceintervals but with p values that are all over the place, the

reader is subjected to information overload.The actual magnitude of the p-value is not important.

All that matters is whether or not it is less than a. (Claimingthat a finding “approached statistical significance,” “justmissed being statistically significant,” or the like is simplyindefensible.) The strength of the effect is indicated by thetest statistic.

Thou Shalt Not Use the Word “Significant” inthe Statement of a Hypothesis

In the nursing research literature, it is common to find thephrasing of a hypothesis as “There is no significant relation-ship between X and Y” (null) or “There is a significant relation-ship between X and Y” (alternative). Both are wrong.Hypotheses are about populations and their parameters, evenwhen they are tested in a given study by using statisticalresults for samples. Statistical significance or non-significanceis a consequence of choice of significance level and samplesize as well as the actual magnitude of the test statistic.

The primary interest is in the population, from whichthe sample has been drawn. Example of poor wording:

1There is an unfortunate error in this article. Cohen claimed that many people think the p-value is the proba-bility that the null hypothesis is false. Not so; they think the p-value is the probability that the null hypothesis istrue. It is neither. The p-value is the probability of the obtained result or anything more extreme, if the nullhypothesis is true.

Research in Nursing & Health

348 RESEARCH IN NURSING & HEALTH

There is no significant relationship between height andweight. Example of appropriate wording: There is no rela-tionship between height and weight. The relationshipbetween height and weight in the sample might or might notbe statistically significant. In the population, there is either arelationship between height and weight or there is not. SeePolit and Beck (2008) for a good discussion of the wordingof hypotheses.

Thou Shalt Not Refer to Power After a StudyHas Been Completed

Power is an a priori concept, as is significance level. Beforea study is carried out, researchers specify the power they

desire, reflecting the desired probability of accurately reject-ing the null hypothesis when it is false, and accrue a sam-ple of the size that provides such power. After the studyhas been completed, whether or not the null hypothesis hasbeen rejected, it is inappropriate to address the power thatthey actually had after the fact. If the null hypothesis isrejected, the researchers had sufficient power to do so. If itis not rejected, either there was not sufficient power or thenull hypothesis is true.

To calculate an observed (post hoc, retrospective)power is worthless. It is perfectly correlated (inversely) withthe observed p value; that is, the higher the observedpower, the smaller the p value associated with it. For excel-lent discussions of problems with observed power, seeZumbo and Hubley (1998) and Hoenig and Heisey (2001).

Thou Shalt Not Use Descriptive StatisticsDeveloped for Interval and Ratio Scales toSummarize Ordinal Scale Data

Consider the (arithmetic) mean, the standard deviation, and

the Pearson product-moment correlation coefficient: Thesethree ways of summarizing data, whether for a populationor for a sample drawn from a population, are used veryoften in quantitative research. All of them require that thescale of measurement be at least interval (they are alsofine for ratio scales), because they all involve the additionand/or the multiplication of various quantities. Such calcula-tions are meaningless for nominal and ordinal scales. Forexample, you should not add a 1 for “strongly agree” to a 5for “strongly disagree,” divide by 2 (i.e., multiply by 1/2),and get a 3 for “undecided.”

Some people see nothing wrong with describing ordi-nal data through the use of means, standard deviations,and Pearson rs. Others are vehemently opposed to sodoing. There have been heated arguments about this sinceStevens (1946) proposed his nominal, ordinal, interval, andratio taxonomy [see, for example, Gaito (1980) and Town-send and Ashby (1984)], but cooler heads soon prevailed(Moses, Emerson, & Hosseini, 1984). An entire book(Agresti, 2010) has been written about the proper analysisof ordinal data. For one of the best discussions of the

inappropriateness of using traditional descriptive statisticswith ordinal scales, see Marcus-Roberts and Roberts(1987).

Because of the prevalence of ordinal measurement inclinical research and the favoring of parametric inferencerather than non-parametric inference, it is difficult to forgotreating ordinal scales like interval scales, given thatparametric tests usually have greater power than non-parametric tests. However, the latter can have greaterpower when the assumptions for the level of measurement(i.e., interval or ratio) for parametric tests are not satisfied[see, for example, Sawilowsky (2005)].

Thou Shalt Not Report Percentages That DoNot Add to 100 Without a Note Indicating WhyThey Do Not

When reading a journal article that contains percentages fora particular variable, it is instructive to check to see if theyadd to 100, or very close to 100 (if they have beenrounded). Why? First of all, if they do not add to 100, it mayindicate a lack of care that the authors might have takenelsewhere in the article. Second, almost every study hasmissing data, and it is often the case that the percentagesare taken out of the total sample size rather than the non-missing sample size. Finally, in an excellent article, Mostel-

ler, Youtz, and Zahn (1967), show that the probability ofpercentages adding to exactly 100 is perfect for variableswith two categories, approximately 3/4 for three categories,approximately 2/3 for four categories, and approximately√6/cp for c� 5, where c is the number of categories and p

is the well-known ratio of the circumference of a circle to itsdiameter (¼approximately 3.14). The greater the number ofcategories, the less likelihood of an exact total of 100%.

A related matter is the accurate use and interpretationof percentages in cross-tabulations. For a contingency(“cross-tabs”) table, for example, both SAS and SPSS out-put can provide three percentages (row, column, and total)for each of the cell frequencies. Usually only one of those isrelevant to the research question. Using the wrong one canhave a serious effect on the interpretation of the finding.See Garner (2010) for an interesting example of a study ofthe relationship between pet ownership and home location,in which using percentages the wrong way resulted in fail-ure to answer the research question.

Thou Shalt Not Use Pearson Product-MomentCorrelation Coefficients Without First Testingfor Linearity

This commandment is arguably broken more often than any

of the others. The “Pearson r,” as it is affectionately called,is a measure of the direction and the magnitude of the lin-ear relationship between two variables, but it is unusual toread an article in which the researcher(s) has(have) pro-vided a scatterplot or otherwise indicated that they tested

Research in Nursing & Health

349TEN STATISTICS COMMANDMENTS/KNAPP AND BROWN

the linearity of relationship before calculating Pearson r.The principal offenders are those who report Cronbach's a

as a measure of the internal consistency reliability of amulti-item test without exploring the linearity of the relation-ships involved. Non-linear inter-item relationships can seri-ously influence (upward or downward) that well-knownindicator of reliability. (See Sijtsma [2009] for a critique ofthe uses and misuses of Cronbach's a.)

In order to test for linearity, there are three choices:(1) “eyeball” the scatterplot to see whether or not thepattern looks linear; (2) use residual analysis (Verran &Ferketich, 1987); or (3) test for statistical significance thedifference between the sample r2 and the sample h2. (Thistest has been available for over a century. See, for exam-ple, Blakeman [1905]). A brief sentence regarding linearityin the research report would assure the reader that thematter has been addressed.

Thou Shalt Not Dichotomize or OtherwiseCategorize Continuous Variables Without aVery Compelling Reason for Doing So

Many people think that in order to test main effects andinteraction effects of independent variables, those variablesmust always be categorical. Not so. Categorizing a continu-ous variable always throws away interesting information,violates the assumptions of many parametric statistics,and is not advisable. (See, for example, MacCallum, Zhang,Preacher, and Rucker [2002], Streiner [2002], and Owenand Froman [2005]). A multiple regression approach to theanalysis of variance permits testing of main effects andinteraction effects of both continuous and categorical inde-pendent variables.

A continuous variable that is frequently categorized isage. It could be argued that continuous age is always cate-gorized, even for single years of age of 1, 2, 3, etc. Butdichotomization is an extreme case of coarseness of cate-gorization, and is rarely necessary. Age can be determinedto the nearest year, month, week, day, or even minute,using statistical software such as SAS or SPSS. Just inputthe date of birth and the date that is of concern in the study,and age can easily be calculated. There is no need totransform actual ages into intervals such as 0–4, 5–9, 10–14, etc.; or into popular categories such as “Silent Genera-tion” (those born 1922–1945), “Baby Boomers” (those born1946–1964), “Generation X” (those born 1965–1980), and“Millennial Generation” (those born 1981–2000), unlessthose categories are of direct interest in the study.

Thou Shalt Not Test for Main Effects WithoutFirst Testing for Interaction Effects in aFactorial Design

All combinations of significance of main effects and interac-tion effects are possible (both significant, both not, and onesignificant and the other not). When both are statistically

significant, the interaction takes precedence in interpretingthe results (the effect of Factor A depends upon the level ofFactor B). Interaction effects constrain the interpretation ofthe findings, so there is no need to test the main effectswhen the interaction is statistically significant. Such findingsare often theoretically undesirable if the aim is to answerdefinitively whether or not a treatment is effective, but theycan be useful in practice. For example, an experimentalintervention might be better for males than for females,whereas the corresponding control intervention might bebetter for females than for males. See the second com-mandment (above) for the special case of treatment-by-site interaction, and see Ameringer, Serlin, and Ward(2009) regarding the phenomenon known as Simpson'sParadox, whereby main and interaction effects can becontradictory.

Conclusion

Unlike the 10 measurement commandments (Knapp &Brown, 1995) that should be broken much of the time, these10 statistics commandments should be kept most of thetime, in order to be methodologically correct. The penaltyfor breaking them will not be eternal damnation, but mightbe erroneous interpretation of data and reduced scholarlyrecognition.

References

Agresti, A. (2010). Analysis of ordinal categorical data (2nd ed.).

New York: Wiley.

Ameringer, S., Serlin, R. C., & Ward, S. (2009). Simpson’s Paradox

and experimental research. Nursing Research, 58, 123–127. doi:

10.1097/NNR.0b013e318199b517

Assmann, S., Pocock, S. J., Enos, L. E., & Kasten, L. E. (2000). Sub-

group analysis and other (mis)uses of baseline data in clinical

trials. The Lancet, 355, 1064–1069. doi: 10.1016/S0140-6736(00)

02039-0

Blakeman, J. (1905). On tests for linearity of regression in frequency

distributions. Biometrika, 4, 332–350.

Cohen, J. (1988). Statistical power analysis for the behavioral scien-

ces (2nd ed.). Hillsdale, NJ: Erlbaum.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112,

155–159.

Cohen, J. (1994). The earth is round (p< .05). American Psycholo-

gist, 49, 997–1003. doi: 10.1037/0003-066X.49.12.997

Cumming, G., & Finch, S. (2005). Inference by eye: Confidence

intervals and how to read pictures of data. American Psycholo-

gist, 60, 170–180.

Efird, J. (2011). Blocked randomization with randomly selected

block sizes. International Journal of Environmental Research and

Public Health, 8, 15–20. doi: 10.3390/ijerph8010015

Gaito, J. (1980). Measurement scales and statistics: Resurgence of

an old misconception. Psychological Bulletin, 87, 564–567. doi:

10.1037//0033-2909.87.3.564

Garner, R. (2010). The joy of stats (2nd ed.). Toronto: University of

Toronto Press.

Research in Nursing & Health

350 RESEARCH IN NURSING & HEALTH

Hoenig, J. M., & Heisey, D. M. (2001). The pervasive fallacy of

power calculations for data analysis. The American Statistician,

55, 19–24.

Knapp, T. R., & Brown, J. K. (1995). Ten measurement command-

ments that often should be broken. Research in Nursing & Health,

18, 465–469. doi: 10.1002/nur.4770180511

Kraemer, H. C. (2000). Pitfalls of multisite randomized clinical trials

of efficacy and effectiveness. Schizophrenia Bulletin, 26, 533–

541.

MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D.

(2002). On the practice of dichotomization of quantitative varia-

bles. Psychological Methods, 7, 19–40. doi: 10.1037//1082-

989X.7.1.19

Marcus-Roberts, H. M., & Roberts, F. S. (1987). Meaningless statis-

tics. Journal of Educational Statistics, 12, 383–394. doi: 10.3102/

10769986012004383

Moses, L. E., Emerson, J. D., & Hosseini, H. (1984). Analyzing data

from ordered categories. New England Journal of Medicine, 311,

442–448. doi: 10.1056/NEJM198408163110705

Mosteller, F., Youtz, C., & Zahn, D. (1967). The distribution of sums

of rounded percentages. Demography, 4, 850–858.

Owen, S. V., & Froman, R. D. (2005). Why carve up your continuous

data? Research in Nursing & Health, 28, 496–503. doi: 10.1002/

nur.20107

Polit, D. F., & Beck, C. T. (2008). Nursing research: Generating and

assessing evidence for nursing practice. Philadelphia: Lippincott.

Sawilowsky, S. S. (2005). Misconceptions leading to choosing the t

test over the Wilcoxon Mann–Whitney U test for shift in location

parameter. Journal of Modern Applied Statistical Methods, 4,

598–600.

Senn, S. (1994). Testing for baseline balance in clinical trials. Statis-

tics in Medicine, 13, 1715–1726. doi: 10.1002/sim.4780131703

Sijtsma, K. (2009). On the use, the misuse, and the very limited use-

fulness of Cronbach’s alpha. Psychometrika, 74, 107–120. doi:

10.1007/s11336-008-9101-0

Slakter, M. J., Wu, Y.-W. B., & Suzuki-Slakter, N. S. (1991). �, ��, and���: Statistical nonsense at the .00000 level. Nursing Research,

40, 248–249.

Stevens, S. S. (1946). On the theory of scales of measurement. Sci-

ence, 103, 677–680.

Streiner, D. L. (2002). Breaking up is hard to do: The heartbreak of

dichotomizing continuous data. Canadian Journal of Psychiatry,

47, 262–266.

Townsend, J. C., & Ashby, F. G. (1984). Measurement scales and

statistics: The misconception misconceived. Psychological Bulle-

tin, 96, 394–401. doi: 10.1037/0033-2909.96.2.394

Verran, J. A., & Ferketich, S. L. (1987). Testing linear model

assumptions: Residual analysis. Nursing Research, 36, 127–

129.

Zumbo, B., & Hubley, A. M. (1998). A note on misconceptions con-

cerning prospective and retrospective power. The Statistician,

47, 385–388. doi: 10.1111/1467-9884.00139

Research in Nursing & Health

351TEN STATISTICS COMMANDMENTS/KNAPP AND BROWN