should we forget noecs?

2
sample size, variability, and effect size) provides the a level with the lowest possible combination of Type I and II errors for a given study design (Figure 1A) (if we assume that the a priori probabilities of the alternative and null hypotheses are equal). For cases where the costs of Type I and II errors are known to be unequal, a can instead be set to minimize the probability-weighted cost of Type I and Type II errors at the critical effect size (C I a þ C II b, where C I and C II represent the relative costs of Type I and Type II errors) to provide the a level with the lowest possible combined cost of Type I and II errors for a given study design (Figure 1B). As estimates of relative costs of Type I and Type II errors are subjective, minimizing average costs of Type I and II errors, instead of minimizing average probabilities of Type I and II errors, should only be considered when there is a clear and defensible a priori justification that 1 type of error has higher costs than the other. Optimal a can be calculated for any test for which statistical power can be calculated. R-code has been devel- oped to perform the iterative steps necessary to calculate an optimal a for t tests, linear regression, and ANOVA, and is available from the authors. IMPLICATIONS AND FUTURE RESEARCH The ab optimization method provides transparency of Type I and Type II error rates and yields a clear representation of the amount of evidence provided by a study (Mudge et al. 2012). Alpha–beta optimization can also allow sample sizes to be estimated for a desired average probability or cost of error. A cost-based approach has important implications for research questions where the costs of mistakes can be reasonably estimated because it provides an objective approach to decision-making that minimizes the dollars spent on mistakes. Where the costs of errors are more subjective and difficult to estimate, one important consequence of setting an optimal a is that it forces stakeholders to confront disagreements about the magnitude of the critical effect size and the relative costs of Type I and II errors (providing the option to minimize probabilities of error when consensus on relative costs cannot be reached). Implementation of optimal a would demand dramatic changes in the way we approach hypothesis testing, would accelerate progress in an already burgeoning area—identifying critical effect sizes—and would encourage increased attention to the almost unexamined question of relative costs of Type I and Type II errors in pure and applied studies. Acknowledgment We thank Kelly Munkittrick, Re ´my Rochette, Thijs Bosker, Peter Chapman, and attendees of the 2010 Aquatic Toxicology Workshop (Oct 3–6; Toronto, Ontario) for discussions and encouragement that aided in the development of this work. REFERENCES Dayton PK. 1998. Reversal of the burden of proof in fisheries management. Science 279:821–822. Field SA, Tyre AJ, Jonze ´n N, Rhodes JR, Possingham HP. 2004. Minimizing the cost of environmental management decisions by optimizing statistical thresholds. Ecol Lett 7:669–675. Mapstone BD. 1995. Scalable decision rules for environmental impact studies: effect size, Type 1, and Type 2 errors. Ecol Appl 5:401–410. Mudge JF, Baker LF, Edge CB, Houlahan JE. 2012. Setting an optimal a that minimizes errors in null hypothesis significance tests. PLoS One 7:e32734. Munkittrick KR, Arens CJ, Lowell RB, Kaminski GP. 2009. A review of potential methods of determining critical effect size for designing environmental monitoring programs. Environ Toxicol Chem 28:1361–1371. Newman MC. 2008. What exactly are you inferring? A closer look at hypothesis testing. Environ Toxicol Chem 27:1013–1019. SHOULD WE FORGET NOECs? Francisco Sa ´nchez-Bayo* University of Technology Sydney, Lidcombe, Australia *[email protected] DOI: 10.1002/ieam.1312 In recent months there has been a great deal of criticism about the no-observed effect concentration (NOEC) and related concepts (e.g., lowest-observed effect concentration, LOEC) in the two SETAC journals, Environmental Toxicology and Chemistry (ET&C) and Integrated Environmental Assess- ment and Management (IEAM), to the point that a ban has been proposed to their use in the scientific literature (Landis and Chapman 2011). Although not everyone agrees on imposing a ban (Fox 2012), it is my understanding that unless such drastic measures are implemented, most authors will continue to estimate and use those outdated concepts. The debate on this topic is overdue, as it is now 2 decades since the statistical flaws of such concepts were pointed 564 Integr Environ Assess Manag 8, 2012—PM Chapman, Editor 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 +β)/2 α A 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 C I *α+C II *β α B Figure 1. Influence of a on (A) the average of the probabilities of Type I and II error, and (B) the probability-weighted costs of Type I and II errors when Type I errors are 4 times as costly as Type II errors (dotted line) and when Type I errors are 1/4 as costly as Type II errors (dashed line). Data shown are for an independent, 2-sample, 2-tailed t test with n 1 ¼ n 2 ¼ 20 and a critical effect size equal to the pooled standard deviation. Drop lines indicate optimal a levels under each circumstance.

Upload: francisco-sanchez-bayo

Post on 09-Aug-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Should we forget NOECs?

sample size, variability, and effect size) provides the a levelwith the lowest possible combination of Type I and II errorsfor a given study design (Figure 1A) (if we assume that the apriori probabilities of the alternative and null hypothesesare equal). For cases where the costs of Type I and II errors areknown to be unequal, a can instead be set to minimize theprobability-weighted cost of Type I and Type II errors atthe critical effect size (CIaþCIIb, where CI and CII representthe relative costs of Type I and Type II errors) to provide thea level with the lowest possible combined cost of Type I andII errors for a given study design (Figure 1B). As estimates ofrelative costs of Type I and Type II errors are subjective,minimizing average costs of Type I and II errors, instead ofminimizing average probabilities of Type I and II errors,should only be considered when there is a clear and defensiblea priori justification that 1 type of error has higher costs thanthe other. Optimal a can be calculated for any test for whichstatistical power can be calculated. R-code has been devel-oped to perform the iterative steps necessary to calculate anoptimal a for t tests, linear regression, and ANOVA, and isavailable from the authors.

IMPLICATIONS AND FUTURE RESEARCHThe a–b optimization method provides transparency of

Type I and Type II error rates and yields a clear representationof the amount of evidence provided by a study (Mudge et al.2012). Alpha–beta optimization can also allow sample sizes

to be estimated for a desired average probability or costof error. A cost-based approach has important implicationsfor research questions where the costs of mistakes can bereasonably estimated because it provides an objectiveapproach to decision-making that minimizes the dollars spenton mistakes. Where the costs of errors are more subjectiveand difficult to estimate, one important consequence ofsetting an optimal a is that it forces stakeholders to confrontdisagreements about the magnitude of the critical effect sizeand the relative costs of Type I and II errors (providing theoption to minimize probabilities of error when consensus onrelative costs cannot be reached). Implementation of optimala would demand dramatic changes in the way we approachhypothesis testing, would accelerate progress in an alreadyburgeoning area—identifying critical effect sizes—and wouldencourage increased attention to the almost unexaminedquestion of relative costs of Type I and Type II errors in pureand applied studies.

Acknowledgment—We thank Kelly Munkittrick, RemyRochette, Thijs Bosker, Peter Chapman, and attendees ofthe 2010 Aquatic Toxicology Workshop (Oct 3–6; Toronto,Ontario) for discussions and encouragement that aided in thedevelopment of this work.

REFERENCESDayton PK. 1998. Reversal of the burden of proof in fisheries management. Science

279:821–822.

Field SA, Tyre AJ, Jonzen N, Rhodes JR, Possingham HP. 2004. Minimizing the cost

of environmental management decisions by optimizing statistical thresholds.

Ecol Lett 7:669–675.

Mapstone BD. 1995. Scalable decision rules for environmental impact studies:

effect size, Type 1, and Type 2 errors. Ecol Appl 5:401–410.

Mudge JF, Baker LF, Edge CB, Houlahan JE. 2012. Setting an optimal a that

minimizes errors in null hypothesis significance tests. PLoS One 7:e32734.

Munkittrick KR, Arens CJ, Lowell RB, Kaminski GP. 2009. A review of potential

methods of determining critical effect size for designing environmental

monitoring programs. Environ Toxicol Chem 28:1361–1371.

Newman MC. 2008. What exactly are you inferring? A closer look at hypothesis

testing. Environ Toxicol Chem 27:1013–1019.

SHOULD WE FORGET NOECs?

Francisco Sanchez-Bayo*

University of Technology Sydney, Lidcombe, Australia

*[email protected]

DOI: 10.1002/ieam.1312

In recent months there has been a great deal of criticismabout the no-observed effect concentration (NOEC) andrelated concepts (e.g., lowest-observed effect concentration,LOEC) in the two SETAC journals, Environmental Toxicologyand Chemistry (ET&C) and Integrated Environmental Assess-ment and Management (IEAM), to the point that a ban hasbeen proposed to their use in the scientific literature (Landisand Chapman 2011). Although not everyone agrees onimposing a ban (Fox 2012), it is my understanding that unlesssuch drastic measures are implemented, most authors willcontinue to estimate and use those outdated concepts.

The debate on this topic is overdue, as it is now 2 decadessince the statistical flaws of such concepts were pointed

564 Integr Environ Assess Manag 8, 2012—PM Chapman, Editor

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5

(α+β

)/2

α

A

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5

CI*

α+C

II*β

α

B

Figure 1. Influence of a on (A) the average of the probabilities of Type I and II

error, and (B) the probability-weighted costs of Type I and II errors when Type I

errors are 4 times as costly as Type II errors (dotted line) and when Type I errors

are 1/4 as costly as Type II errors (dashed line). Data shown are for an

independent, 2-sample, 2-tailed t test with n1¼n2¼ 20 and a critical effect

size equal to the pooled standard deviation. Drop lines indicate optimal a

levels under each circumstance.

Page 2: Should we forget NOECs?

out (Skalski 1981). As a scientific community we still seemincapable of changing our way of doing things. And if wecannot change our entrenched bad habits, can we expectregulators to do otherwise?

Some critics have blamed a lack of understanding amongmost ecotoxicologists about the limitations of NOECs andLOECs and their statistical flaws as the main reason for nothaving changed direction (Warne and van Dam 2008). Wehave reached this situation because most textbooks ontoxicology teach how to estimate these measures of toxicitywithout any criticism. Correcting this problem may require: i)teaching undergraduate students the pitfalls of these proce-dures, while teaching them new alternative approaches andmethods for assessing dose-response relationships (Olmsteadand LeBlanc 2005; Fox 2010; Ashauer et al. 2011), and ii) re-training staff at universities, toxicity testing laboratories,chemical laboratories, government departments and agenciesinvolved in ecotoxicology and risk assessment. Some of uswould be happy to provide such teaching and training.

However, training alone will not solve the problemof continued use of NOECs and LOECs in scientificpublications. As Jager (2012) indicates, there are alreadylarge numbers of NOEC values estimated for thousands ofchemical compounds in the literature and in databases (e.g.,ECOTOX). Should we ignore this information built upon somany years of research? Maybe not. Despite all theirstatistical flaws, many NOEC values are similar to theno-effect concentration (NEC) values estimated by othertechniques. For instance, Fox (2010) reports that 3 of 4‘no effect estimates’ for waters from a mine in northernAustralia obtained by traditional NOEC methods werecomparable to calculated NEC values. The implication isclear: existing NOEC and LOEC data could be sufficientlyreliable in many cases to be used in risk assessments, and,therefore, should be allowed in regulatory assessmentswhenever more reliable data (e.g., NEC, EC10) are lackingor unavailable.

Perhaps we should read carefully what Landis and Chap-man (2011) said about their proposed ban. They called on theeditors of the SETAC journals to ban ‘‘statistical hypothesistests for the reporting of exposure-response from theirjournals.’’ The ban should be for publishing new NOEC orLOEC values that have been derived in accordance with thetraditional methods. It shouldn’t apply, as I understand it,when authors use such data (taken from databases, forinstance) to justify some point in their research. For example,in a recent work aimed at evaluating suitable endpoints forassessing the impacts of 2 insecticides at the community level(Sanchez-Bayo and Goka 2012), we compared the protectivelevels estimated by our methods with those derived fromspecies sensitivity distributions (SSDs) of the insecticides.We collected both LC50 and NOEC data from the ECOTOXdatabase to generate the respective SSDs using BurrliOz(http://www.cmis.csiro.au/envir/burrlioz/) and comparedthe protective values derived from these distributions withthose obtained by our methods. We found that NOECswere under-protective by a factor of 10 for one of 2insecticides, whereas for the other one there was nosignificant difference in the protective levels estimated byeither method.

Unfortunately, regulation has been and still is the driverthat encourages most ecotoxicologists to use NOECs. Whilereviewing recently a set of ecotoxicity testing protocols for

a regulatory agency, it struck me that the authors keptestimating NOEC values for all the surrogate species theyused. It seems they did so because they wanted to comparethe NOEC values estimated for the species they tested withNOEC values available for other species in other countries.Were they unaware of the various warnings concerning theuse of this metric? Or rather, did they regard the use ofNOECs as plausible for the regulatory purpose they wereintended for, in spite of other statistical or ‘academic’considerations? As Jager (2012) has indicated, the Organ-ization for Economic Co-operation and Development, theEuropean Chemicals Agency, and the International Organ-isation for Standardization all recommend the phasing out ofNOECs in their regulatory advisories. So, what can be done toensure that the very people involved in producing regulatoryprotocols abide by the new rules? I do not have an answer tothis crucial question, but I am sure readers will come up withsome practical solutions.

Coming back to my initial query, what can be done toextirpate the bad habits among us? As an advocate of usingalternative measures to assess toxicity and toxic relationships(e.g. time-event toxicity protocols - Ashauer et al. 2011), Igive my support to the ban proposed by Landis and Chapman(2011) in their own terms, i.e., as quoted above. In fact, Iwould like the ban to be extended to other journals thatpublish environmental toxicology research, like Ecotoxicology,Ecotoxicology and Environmental Safety, Archives of Environ-mental Contamination and Toxicology, etc. Can I suggest thatour Society approaches the editors of such journals to call forthis ban? Having said this, I would warn the editors to beaware of the difficulties in stopping the publication of NOECand LOEC data that are already there in the public domain.Science progresses step by step, using former concepts anddata as stepping stones for building new concepts and betterdata. Discarding all what has been done in the past would dono good to ecotoxicology.

REFERENCESAshauer R, Agatz A, Albert C, Ducrot V, Galic N, Hendriks J, Jager T, Kretschmann

A, O’Connor I, Rubach MN, and others. 2011. Toxicokinetic-toxicodynamic

modeling of quantal and graded sublethal endpoints: A brief discussion of

concepts. Environ Toxicol Chem 30:2519–2524.

Fox DR. 2010. A Bayesian approach for determining the no effect concentration

and hazardous concentration in ecotoxicology. Ecotoxicol Environ Saf 73:

123–131.

Fox DR. 2012. Response to Landis and Chapman. 2011. Integr Environ Assess

Manag 8:4.

Jager T. 2012. Bad habits die hard: The NOEC’s persistence reflects poorly on

ecotoxicology. Environ Toxicol Chem 31:228–229.

Landis WG, Chapman PM. 2011. Well past time to stop using NOELs and LOELs.

Integr Environ Assess Manag 7(4):vi–viii.

Olmstead AW, LeBlanc GA. 2005. Toxicity assessment of environmentally relevant

pollutant mixtures using a heuristic model. Integr Environ Assess Manag 1:114–122.

Sanchez-Bayo F, Goka K. 2012. Evaluation of suitable endpoints for assessing

the impacts of toxicants at the community level. Ecotoxicology 21:667–

680.

Skalski JR. 1981. Statistical inconsistencies in the use of no-observed-effect-levels in

toxicity testing. In: Branson DR, Dickson KL, (eds) Aquatic Toxicology and

Hazard Evaluation. Philadelphia (PA): American Society for Testing and

Materials, pp 337–387.

Warne MSJ, van Dam R. 2008. NOEC and LOEC data should no longer be

generated or used. Australas J Ecotoxicol 14:1–5.

Integr Environ Assess Manag 8, 2012—PM Chapman, Editor 565