cherry 1998 - statistical tests

Upload: tarumatu

Post on 07-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Cherry 1998 - Statistical Tests

    1/7

    - (.c) A.l.ik ~~c.i~-LValAI! 2.'-: ~'ii.153 -Statistical tests in publications of

    The Wildlife Society. Steve CherryThe 1995 issue of Ibejournal of Wildlife Man- ecology (Yoccoz 1991, Johnson 1995). I will offeragement (thejournal) has >2,400 P-values. believe little new outside of bringing these ideas to wildlifethat is too many. In this article I argue that authors professionalswho publish, read, and review articleswho publish in the ournal and in the Wildlife Soci- in the journal and the Bulletin. I should add that

    ety Bulletin (the Bulletin) are overusing and misus- many authors and reviewers learned how to do sta-ing hypothesis tests. They are conducting too many tistics from statisticians, and so statisticians whounnecessary ests, and they are making common mis- teach this material (and I am one of those) musttakes n carrying out and interpreting the results of share blame for the current state of affairs.the tests hey conduct. A major causeof the overuse I will discuss 4 major problem areas. The flfSt isof testing in thejournal and the Bulletin seems o be the widespread use of testing in situations where it isthe mistaken belief that testing is necessary n order not warranted. The second s the apparent confusionfor a study to be valid or scientific. over power and its interpretation. This confusion isThe opinions presented below are not new or leading authors to draw the wrong conclusion whenunique. The eagernesswith which scientists em- they observehigh P-valuesand ow power. The thirdbraced testing has been questioned before, with 1 is how authors assesshe validity of assumptionsofcritic labeling the attraction a religion (Salsburg the tests they use. The fourth is the widespread use1985). It may be surprising to many wildlife re- of fIXed-level esting. I restrict my comments to sim-searchers o learn that some of the strongest critics pie procedures like t-tests, and simple linear regres-of testing are statisticians. The statistician Frank sion, because most wildlife researchersare familiarYates wrote, "It [statistical testing] has caused sci- with these.entific research workers to pay undue attention to I do not make a distinction between Fisherian sig-the results of the tests of significance they perform nificance testing and Neyman-Pearson hypothesison their data. . . and too little to the estimatesof the testing. The manner n which tests are carried out inmagnitude of the effects they are estimating" (Yates thejournal and the Bulletin is a hybrid of these 2 ap-1951:32). Cox (1977) began his article on signifi- proaches. The distinction has been addressedelse-cance testing with "Overemphasison tests of signifi- where and readers who are interested can refer tocance at the expense especially of interval estima- Cohen (1990), Goodman (1992), and Moore and Mc-tion has long been condemned " In the Presi- Cabe (1993). It is worth pointing out that the 2 ap-dent's Address at the 1996 meeting of the Western proaches are different. Fisher hought the alternativeNorth American Region of the Biometric Society, hypothesis, Type I and II errors, and power ridicu-Professor John Neider lamented the "malign influ- lous concepts. P-values and their interpretationence of P-values" and asked, "Why do editors think played no role in the Neyman-Pearson heory.that P-value dominated analysis constitutes a scien-tific procedure?" There have been recent attacks on U t tstesting in psychology and epidemiology (Tukey nnecessary es1969; Rothman 1986; Simon 1986; Rosnow and Manyof the tests eported n thejournal and heRosenthal 989;Cohen1990,1995;Goodman 992, Bulletin areunnecessary. hreespecificcategories1993). Othershavecriticized the use of testing n havebeenchosen o illustrate his point: testing nAuthor's address: Department of Mathematics, Montana Tech of The University of Montana, Butte, MT 59701, USA.Key words: confidence intervals, hypothesis esting, P-value, power, significance testing

  • 8/6/2019 Cherry 1998 - Statistical Tests

    2/7

    habitat-use-availability studies, testing in regres- ther in Byers et al. (1984), may not be the best to use.sion, and testing biologically insignificant or obvi- Cherry (1996) discussed some alternatives. Tukeyous results. (1991) offered further justification for the use of in-tervals n multiple-comparisonsprocedures. Many ofHabitat-use-availability studies the alternatives o the Neu et at. (1974) method sufferNeu et al. (1974) has been cited hundreds of times from the same problem, notably the increasinglyin the published literature. The times it has been popular method of Aebischer and Robertson (1993).cited in unpublished theses and reports surely num- There are approaches to the problem of habitat se-ber in the thousands. The approach .recommended lection that rely on more sophisticated methods, butin Neu et at. (1974) starts with a chi-squaregoodness- constructing a set of intervals is easy o do and thereof-fit test to determine if animals are selecting habi- will be many times when it is a perfectly reasonabletats in proportion to their availability. If the null hy- thing to do.pothesis of random use is rejected, then a set of si-multaneous (typically) 95% confidence intervals is Testing in regressionconstructed to determine which habitats are being One of the more common figures to appear n theselectedand which are being avoided. journal and the Bulletin shows the results of con-

    The chi-square est followed by construction of the ducting a simple linear regressionof a responsevari-intervals f the null hypothesis s rejected s an exam- able on an explanatory variable. The figure shows hepie of a hierarchical, multiple-comparisons proce- dataand someor all of the following: a regression inedure. The testing procedure s logically flawed in this and the estimated equation, a P-value, and an R-casebecause t is possible to reject the null and not squaredvalue. The P-valueprovides evidence that afind evidence of selection or avoidance n the inter- significant relationship exists and the R-squared aluevats,and t is possible o fail to reject the null hypoth- is a measure,seemingly,of the goodnessof the fit. Inesisand fmd evidenceof selection or avoidance n the particular, the R-squared value seems to be under-intervals. Hochberg and Tarnhane 1987) referred to stood as a measureof the predictive capability of thesuch problems as a ack of coherenceand consonance model. The null hypothesis s that the slope parame-and considered any procedure lacking coherence to ter in the model is equal to O. The alternative s thatbe fatally flawed. This in itself leads o the conclusion the slope s not equal to O. In the majority of cases nthat wildlife researchers hould avoid the testing pro- the ournal and the Bulletin, these are uninterestingcedure, but there is a second problem with the ap- hypotheses. The null is false; a relationship does ex-proach that is even more fundamental: animalsdo not ist. The more interesting question is "What is the re-choose habitats in proportion to their availability. lationship?" In general, there is little mention of theWhy conduct a test to determine f they do? standarderror of the slope of the regression ine, butThe question of habitat-use-availability is more that is far more important for answering he real ques-properly approachedas an estimation problem. That tion of interest than the P-value or the R-squaredis, researchers hould not be asking he question" Are value. The lack of this appropriate measureof uncer-animalsusing habitats n proportion to their availabil- tainty is particularly puzzling in articles where theity," but rather "How much time are animals spend- main use of the published regression esults s predic-ing in the available habitats." One approach to an- tion (e.g., Millspaugh and Brundige 1996). The R-swering this question is to omit the test and simply squaredvalue does not provide an adequatemeasureconstruct the set of confidence intervals. If the rele- of the predictive capability of a model (Neter et at.vant assumptionsare met, the intervals are valid re- 1996:82-83). It is possible to have a high R-squaregardlessof the outcome of the test (Byerset al. 1984, value and still have prediction intervals so wide as toCherry 1996). One can compare the available pro- be practically useless. A regression equation to beportions with the endpoints of the intervals to see used n prediction should always be accompaniedbywhich habitats are being selectedor avoided. the associatedprediction intervals, or enough infor-I believe that it is better to avoid the question of mation to allow readers o construct them.testing altogether. However, if one treats the confi-dence nterval procedure as a test with the samenull Testing obvious hypothesesand alternative hypothesesas n Neu et al. (1974) the The 2 problem areas discussed above dealt withprobability of a Type I error is being controlled (con- this topic in some fashion. I wish to go further andservatively) at a rate equal to 1 minus the simultane- address he unnecessary esting (or at least the un-ous confidence level. The standard arge sample n- necessary reporting of the results of testing) in 2tervals used in Neu et at. (1974), and explained fur- cases: (1) testing when there is no difference or ef-

  • 8/6/2019 Cherry 1998 - Statistical Tests

    3/7

    fect of biological significance, and (2) testing when tigators cannot be trusted to draw a valid conclusionthe results are obvious. The main messages this: it on the basis of the biological or visual evidenceis not necessary o test every result. alone. This is simply not true. Another consequenceIntroductory statistics texts (e.g., Moore and Mc- of the overemphasis on testing is that investigatorsCabe 1993) often point out that statistically signifi- may be forced to carry out difficult, sophisticatedcant results do not necessarily mean that one has analysesn order to get a result published even whenfound something of importance. Observed biologi- the effect is obvious. As Cox and Snell (1981: 24) putcally insignificant effects do not need to be tested. I it, "...pressures o apply...tests of significance to to-realize that the determination of what COD6titutes tally obvious results should be resisted."biologically significant result is subjective. But re-searchersshould at least look at the results of their. .studies and ask themselves if the observed effects MtSunderstandrng powerare biologically meaningful before carrying out tests. There has been an increase n reporting the powerOne clear-cut example can be found in Franklin and of statistical tests in the journal and the Bulletin.Johnson (1994:255) where a 2-sample -test was con- Power is the probability of rejecting the null hypoth-ducted when the 2-sample means had the same esis when it is false. It is a function of the Type I er-value. ror rate, the sample size, and the effect size. Typi-Pre-testpower calculationscan provide some guid- cally power calculations are carried out prior to run-ance in determining if an observed effect is biologi- ning a study. Acceptable evels of error for both Typecally significant insignificant). There s little reason o I and Type II errors are specified and the sample sizetest an observedeffect if it is less han the biologically necessary o achieve hose evels s determined. Typ-meaningfuleffect specified n the power calculation. icallevels for Type I error rates are 0.05 or 0.10, andTesting can be avoided when differences are obvi- typical acceptable error rates for Type II errors areously going to be statistically significant or insignifi- 0.20 orO.l0 (corresponding to power of 0.80 or 0.90,cant. Leif (1994) compared survival rates, predation respectively). Also needed n this preliminary powerrates, and nesting successof wild versus pen-raised calculation is a decision by the investigator(s) as topheasants Phasianus colchicus). Most of the ob- what constitutes a meaningful effect size, .e., a pre-serveddifferenceswere so arge (e.g., pen-rearedsur- liminary power calculation requires researchers ovival rate estimatedat 7.8 :i: 2.4%SEvs. wild survival consider in an explicit way just what constitutes a bi-rate estimated at 54.6 :i: 6.6% SE) hat testing was not ologically significant result. That is one of the mostneeded to conclude the differences were real. The beneficial aspectsof computing power. If the resultsreported rates and standard errors were sufficient. of the initial power calculationsreveal hat one needsAnother example can be found in Hyvarinen and Ny- an unacceptably arge sample o detect a meaningfulgren (1993) in which a comparison of copper and difference at the specified error rates, then the onlyzinc concentrations in the livers of newborn and options are to adjust those rates or to adjust the defi-older moose (Alces alces) calves s made. Their Fig- nition of a biologically significant effect. It may turnure 2 showed quite clearly that there was a differ- out that the results of the power calculation implyence. The graphical evidence was overwhelming, that the study is not worth doing.and a test was not required. Power calculationsare also done in a post-testsitu-I am not advocating that researchers pore over ation if the results of the test fail to yield significanttheir data n a search or significance. This is not ap- results. This is a common situation in the journalpropriate. But researchersshould have well-defined and the Bulletin. In either case t is inappropriate forquestions o answer, and some dea of the biological authors to claim that there is no effect if the results ofdifferencesand effects hat are of interest to them be- a test yield a nonsignificant P-value and low power.fore beginning a study. They do not necessarilyneed In other words, failing to reject the null hypothesis ofa statistical test to determine if the differences or ef- no effect is not the same hing as accepting the nullfects they are looking for exist or do not exist. hypothesis.Somemight contend that it does no harm to carry For example, Cotter and Gratto (1995:95) stated,out tests and report the P-valuesanyway, but I dis- "Adult hens did not produce..." and reported the re-agree. Tests should be conducted only when and sults of a test with a P-value of 0.92 and power ofwhere necessary. Cluttering up the pages of the 0.17. It is important to understand what this means.journal and the Bulletin with hundreds of unneces- They were saying that if a specified biologicallysaryP-values eads o a perception that they are a re- meaningful difference existed prior to testing, thenquired part of valid scientific researchand that inves- they had an estimated 17% chance of detecting it

  • 8/6/2019 Cherry 1998 - Statistical Tests

    4/7

    with their sample size. However, the claim of no ef- lations. What is large enough? There: s no clear-cutfect was not warranted because they had little answer. It is always dangerous to give prescribedchance of detecting the effect even if it had existed. guidelines, but if sample sizesare >40 (for each sam-Cotter and Gratto (1995) appeared to have con- pie) one typically does not need to worry about non-ducted post-testpower analyses.Wilson et al. (1995) normality. The test is remarkably robust for samplereported that pre-test power calculations yielded sizes~15, particularly if the 2 sample sizesare nearlypower of 0.27. They then reported 97 P-values,most equal (Moore and McCabe 1993). Most normalityof which led to the conclusion that there was no ef- tests have low power when sample sizes are smallfect. However, given that the 97 null hypotheses enough for nonnormality to be a problem in at-test.were all false and the specified biologically meaning- If the sample size is large enough for the normalityful effects actually existed, they would have made test to have sufficient power to detect nonnormality,fewer Type II errors determining significance by flip- then the samplesize s probably large enough for theping a fair coin 97 times and denoting heads signifi- sampling distributions of the sample means o be ap-cant and tails not significant. This criticism is not proximately normal. In short, most of the normalitymeant to imply that Wilson et al. (1995) did not learn tests reported in the ournal and the Bulletin are un-something useful from their study. But nonsignifi- necessary. Investigators should check for nonnor-cant statistical results from tests with power equal to mality but visual assessment sing normal probability0.27 did not shed any light on what they did learn. plots is almost always adequate. Further, probabilityplots are more likely to highlight outliers, which tend. . to be moreof a problem hangeneral kewness.Testmg assumpt1ons When a suitable transformation cannot be found,Statistical tests come with an assortment of as- researchersoften use nonparametric methods. Thesumptions. Most researchers are aware of these. Mann-Whitney U test is a frequent choice for a non-However, it is all too common to fmd the assump- parametric 2-sample est. One of the assumptionsoftions ignored. Evenwhen authors do evaluate he ad- the test is that both samplescome from populationsequacy of assumptions, they tend to focus on the with the same distribution with one shifted to thewrong ones and ignore the most important one. By right of the other. If one distribution is skewed andway of example, the subsequent discussion deals the other is symmetric, the test is not appropriate.mostly with the commonly conducted pooled vari- The Mann-Whitney test is more sensitive to viola-ance 2-sample -test. tions of its distributional assumption han the 2-sam-The relevant assumptions or the pooled variance pIe t-test. I have never seen an example of an article2-sample -test are that the data must be in the form of in the ournal or the Bulletinin in which an author2 simple random samples rom 2 normally distributed or authors attempted to determine if the distribu-populations with the samevariance. It is not uncom- tional assumption, or any other assumption (one ofmon to read that the assumption of normality was which is equal variances),of the Mann-Whitney testchecked. This is frequently followed by reports of was met for their data. Nonparametric implies distri-some kind of transformation or a switch to a non- bution free, not assumption ree. Johnson (1995) dis-parametric test. However, if one has a large enough cusses his and other problems associatedwith thesample o conduct a meaningful test of normality one use of nonparametric procedures n ecology.probably does not need to be worried about nonnor- The fear of nonnormality not only leads re-mality. The distribution of the test statistic in the searchers o unn~cessarily ransform their data or fallpooled t-test will follow a Student's t distribution if prey to the "statistical siren" of nonparametricsthe sampling distribution of the sample means s nor- (Johnson 1995). It overshadows the more seriousmal, the variancesof the 2 populations are equal, and problems caused by violation of the assumption ofif the 2 samplesare simple random samples rom the simple random sampling. This absolutely critical as-2 target populations. The only way to ensure this sumption receives scant attention, and examples ofwith small sample sizes s to require that the data be inappropriately applied tests due to violations of thisnormally distributed, and with small sample sizes n- assumptionare common.vestigators generally have to live with this assump- Gorenzeland Salmon 1995) conducted tests o de-tion because here is no good way to test it. How- termine if differences existed between urban roostever, the Central Limit Theorem implies that with and nonroost trees of crows (Corvus brachyrhyn-large enough sample sizes he sampling distributions cos). They tested for normality (although they hadof the sample means will be approximately normal quite reasonable amplesizesof 87 roost trees and 62regardlessof the distribution of the underlying popu- nonroost trees) and commented on how they han-

  • 8/6/2019 Cherry 1998 - Statistical Tests

    5/7

    ;;t~+~S"~~~ .,cdIed unequal variances. But the description of how is a decision-makingprocess. The goal of research sthey sampled he nonroost trees makes t clear that it not to make a decision, but to provide an incrementalwas not a simple random sample. They gridded their increase n understanding. Decisions are made whenstudy area, chose grid squares at random, and then scientists as a group reach a consensus,after evaluat-randomly selected a nonroost tree in the grid. The ing evidence from many studies.essence of simple random sampling is that every There is nothing sacred about a = 0.05 (or anysample of a specified size has an equal chance of be- other level). There is no clean demarcation betweening selected and this was not true for the nonroost finding an effect and not fmding an effect. A P-valuetrees in this study. Thus, the standard errors of the provides a measure of the strength of the evidencedifference between the sample means could not be againsta specified null hypothesis. This is a continu-computed using the standard ormula, and any (-tests ous scale of measurementwith low P-valuesprovid-conducted using such a standard error calculation ing more evidence than large ones. Authors wouldwere invalid. serve their science better by giving the P-value andI do not wish to criticize the above authors too commenting on what they believe they have earned,harshly. Wildlife studies are inherently difficult and leaving it to others to agree or disagree with them.getting adequate data is challenging. I know that Time and replication will eventually determine whowildlife researchersworry a great deal about getting is correct.a representativesample rom the population of inter- The only justification for choosing a fIXed evel a isest to them, and they have ittle if any control over a the role it plays n power calculations. But specifyinghost of potentially biasing factors. Frankly, if Goren- a level of a in order to determine if a test has reason-zel and Salmon 1995) had explicitly noted that they able power doesnot meanone has o rigidly adhere odid not have a simple random sampleof control trees that level when evaluating the strength of the evi-but were going to act as f they did I would not have dence. I believe he worst thing about fIXed-level est-chosen their paper as an example. The point I am ing is that scientists engaged n the incredibly messytrying to make s that they invested a great deal of ef- business of science (and field ecology is especiallyfort evaluating normality and equal variance assump- messy)abdicate heir responsibility to evaluate he sig-tions and ignored the 1 assumption hat is truly criti- nificance of a result to a canned,cookbook procedure.cal for the validity of their results. Scientists hould never, under any circumstances,et aMoore and McCabe 1993:472-487) argued hat, if statisticalprocedure do their thinking for them.statistical nference s the goal, then researchers eedto have confidence in a probability model, i.e., a .mathematical model that serves as an idealization of ConclusIonsthe process that produced their data. The assump- It is easy o criticize. It is harder to criticize con-tion of simple random sampling ensures he proba- structively, because constructive criticism carriesbility laws apply. Wildlife researchers should be with it the responsibility to offer recommendationsaware hat many of the statistical methodologies hey for improvement. I offer some recommendationsuse were developed for analysis of data from ran- here, recommendations hat I hope will prove to bedomized controlled experiments and applications of constructive.those methods to nonrandom data from observa- My main recommendation is for wildlife re-tional studies must be done with a great deal of care. searchers to stop taking statistical testing so seri-I agreewith Moore and McCabe 1993) that while sta- ously. I believe that most statisticians would agreetistical analysisof such data may be necessary,ow P- with Cox and Snell (1981:39) that testing has "a valu-values alone provide little evidence against null hy- able but limited role to play in the analysisof data."potheses n these types of studies. The presenceof some 2,400 P-values n volume 59 ofthe ournal indicates that the use of testing by wild-life researchers s hardly limited.The myth of 0.05 I make the following recommendations:

    Fixed-level testing is used almost without excep- 1. Investigatorsshould determine if they are moretion in the ournal and the Bulletin. The most com- interested n estimation of effects or testing formon level of significance is a = 0.05. If the P-value the presence or absence of effects. In mostlies below this conventional level, then the typical studies estimation will be more important thaninvestigator will claim the presence of an effect. testing. Estimation requires investigators toMoore and McCabe (1993) argued that choosing a consider what quantitative characteristics bestlevel of significance n advance mplies that research measure the effects of interest followed by

  • 8/6/2019 Cherry 1998 - Statistical Tests

    6/7

  • 8/6/2019 Cherry 1998 - Statistical Tests

    7/7

    analysis of habitat utilization-availability data. Journal of Wild- Yoccoz, N. G. 1991. Use, overuse, and misuse of significance test-life Management 38:541-545. ing in evolutionary biology and ecology. Bulletin of the Eco-

    ROSNOW, . L., ANDR. ROSEN'nIAL. 989. Statistical procedures and logical Society of America 72:106-111.the justification of knowledge in psychological science. Amer-icanPsychologist4:1276-1284. SteveCherry s currently~ssociaterofessor f mathema~ics~ the

    ROTHMANK J. 1986. Significance questing. Annals of Internal Department of Mathematics at Montana Tech of The University ofM d". . 105'445-447 Montana. He received is B.S. n appliedmathematicst NorthS e Ic~es 1985 Th . 1" f .. . din d CarolinaStateUniversity nd his M.S. n ecology t the UniversityAlBBURG, .. . ere Iglon 0 statistics as practice me - of Tennessee. He received an M.S. and a Ph.D. in statistics from

    ical journals. American St~tistician 39:220-2~3. . Montana State University. He worked for several years as a Con-SIMON,R. 1986. Confidence mtervals for reportmg results of clin- servation Officer with the Utah Division of Wildlife Resources. He

    ical trials. Annals of Internal Medicinel05:429-435. was a visiting postdoctoral researcher at the National Center for At-TuKEY,J. W. 1969. Analyzing data: sanctification or detective mospheric Research in Boulder, Colorado, and an adjunct assis-

    work? American Psychologist 24:83-91. tant professor of statistics at Montana State University for 2 years.TuKEY,J. W. 1991. The philosophy of multiple comparisons. Sta- His research interests are in the areas of environmental and eco-tistical Science 6:100-116. logical statistics. He also enjoys teaching the science of statistics.WIION, C. W., R. E. MASTERS,NDG. A. BUKENHOFER.995. Breed-

    ing bird response to pine-grassland community restoration forred-cockaded woodpeckers. Journal of Wildlife Management59:56-67.

    YATES, . 1951. The influence of statistical methods for research ~orkers on the development of the science of statistics. Jour-nal of the American Statistical Association 46: 19-34.