evaluation of the causal framework used for setting …...evaluation of the causal framework used...

22
2013 http://informahealthcare.com/txc ISSN: 1040-8444 (print), 1547-6898 (electronic) Crit Rev Toxicol, 2013; 43(10): 829–849 ! 2013 Informa Healthcare USA, Inc. DOI: 10.3109/10408444.2013.837864 REVIEW ARTICLE Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt, Sonja N. Sax, Lisa A. Bailey, and Lorenz R. Rhomberg Gradient, Cambridge, MA, USA Abstract A scientifically sound assessment of the potential hazards associated with a substance requires a systematic, objective and transparent evaluation of the weight of evidence (WoE) for causality of health effects. We critically evaluated the current WoE framework for causal determination used in the United States Environmental Protection Agency’s (EPA’s) assessments of the scientific data on air pollutants for the National Ambient Air Quality Standards (NAAQS) review process, including its methods for literature searches; study selection, evaluation and integration; and causal judgments. The causal framework used in recent NAAQS evaluations has many valuable features, but it could be more explicit in some cases, and some features are missing that should be included in every WoE evaluation. Because of this, it has not always been applied consistently in evaluations of causality, leading to conclusions that are not always supported by the overall WoE, as we demonstrate using EPA’s ozone Integrated Science Assessment as a case study. We propose additions to the NAAQS causal framework based on best practices gleaned from a previously conducted survey of available WoE frameworks. A revision of the NAAQS causal framework so that it more closely aligns with these best practices and the full and consistent application of the framework will improve future assessments of the potential health effects of criteria air pollutants by making the assessments more thorough, transparent, and scientifically sound. Keywords Air quality, causal framework, criteria pollutants, risk assessment, systematic review, weight of evidence History Received 1 July 2013 Revised 20 August 2013 Accepted 20 August 2013 Published online 1 October 2013 Table of Contents Abstract ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 829 Introduction ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 829 The NAAQS causal framework ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 830 WoE best practices ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 834 Phase 1: Define the causal question and develop criteria for study selection ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 835 Phase 2: Develop and apply criteria for review of individual studies ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 839 Phase 3: Integrate and evaluate evidence ... ... ... ... ... ... ... ... ... 840 Phase 4: Draw conclusions based on inferences ... ... ... ... ... ... 841 Case study: Ozone ISA ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 842 Phase 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 843 Phase 2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 843 Evaluation of study quality ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 843 Consideration of study limitations ... ... ... ... ... ... ... ... ... ... ... ... 843 Measurement error ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 843 Confounders ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 844 Model specification ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 844 Phase 3 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 845 Consideration of modified Bradford Hill criteria ... ... ... ... ... 845 Weighing alternative views ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 846 Phase 4 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 846 Conclusions ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 847 Acknowledgements ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 847 Declaration of interest ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 848 References ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 848 Introduction A scientifically sound assessment of the potential risks associated with a substance requires a systematic, objective and transparent evaluation of the weight of evidence (WoE) for causality of health effects. Many different methods for conducting WoE evaluations have been used by authoritative bodies and reported in the scientific literature (e.g. Adami et al., 2011; Borgert et al., 2011; ECETOC, 2009; IARC, 2006; Rhomberg et al., 2010, 2011a, 2013; Swaen & van Amelsvoort, 2009). The United States Environmental Protection Agency (EPA) uses WoE approaches to assess the risks of health effects from exposures to chemicals in the environment, and these assessments are used to support a wide range of regulatory activities. For example, as part of the process for setting health-based National Ambient Air Quality Standards (NAAQS), EPA periodically evaluates the WoE for potential health risks of exposure to each of six substances that are classified as criteria air pollutants [particulate matter (PM), sulfur dioxide, nitrogen dioxide, carbon monoxide, ozone and lead] because of their national-scale occurrence and public health significance. These evaluations are pre- sented as an Integrated Science Assessment (ISA), in which Address for correspondence: Julie E. Goodman, Gradient, 20 University Road, Cambridge, MA 02138, USA. Tel: 617-395-5000. E-mail: [email protected]

Upload: others

Post on 31-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

2013

http://informahealthcare.com/txcISSN: 1040-8444 (print), 1547-6898 (electronic)

Crit Rev Toxicol, 2013; 43(10): 829–849! 2013 Informa Healthcare USA, Inc. DOI: 10.3109/10408444.2013.837864

REVIEW ARTICLE

Evaluation of the causal framework used for setting National AmbientAir Quality Standards

Julie E. Goodman, Robyn L. Prueitt, Sonja N. Sax, Lisa A. Bailey, and Lorenz R. Rhomberg

Gradient, Cambridge, MA, USA

Abstract

A scientifically sound assessment of the potential hazards associated with a substance requiresa systematic, objective and transparent evaluation of the weight of evidence (WoE) for causalityof health effects. We critically evaluated the current WoE framework for causal determinationused in the United States Environmental Protection Agency’s (EPA’s) assessments of thescientific data on air pollutants for the National Ambient Air Quality Standards (NAAQS) reviewprocess, including its methods for literature searches; study selection, evaluation andintegration; and causal judgments. The causal framework used in recent NAAQS evaluationshas many valuable features, but it could be more explicit in some cases, and some features aremissing that should be included in every WoE evaluation. Because of this, it has not alwaysbeen applied consistently in evaluations of causality, leading to conclusions that are not alwayssupported by the overall WoE, as we demonstrate using EPA’s ozone Integrated ScienceAssessment as a case study. We propose additions to the NAAQS causal framework based onbest practices gleaned from a previously conducted survey of available WoE frameworks. Arevision of the NAAQS causal framework so that it more closely aligns with these best practicesand the full and consistent application of the framework will improve future assessments of thepotential health effects of criteria air pollutants by making the assessments more thorough,transparent, and scientifically sound.

Keywords

Air quality, causal framework, criteriapollutants, risk assessment, systematicreview, weight of evidence

History

Received 1 July 2013Revised 20 August 2013Accepted 20 August 2013Published online 1 October 2013

Table of Contents

Abstract ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 829Introduction ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 829The NAAQS causal framework ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 830WoE best practices ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 834

Phase 1: Define the causal question and develop criteria for studyselection ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 835

Phase 2: Develop and apply criteria for review of individualstudies ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 839

Phase 3: Integrate and evaluate evidence ... ... ... ... ... ... ... ... ... 840Phase 4: Draw conclusions based on inferences ... ... ... ... ... ... 841

Case study: Ozone ISA ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 842Phase 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 843Phase 2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 843

Evaluation of study quality ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...843Consideration of study limitations ... ... ... ... ... ... ... ... ... ... ... ...843

Measurement error ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 843Confounders ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 844Model specification ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 844

Phase 3 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 845Consideration of modified Bradford Hill criteria ... ... ... ... ...845Weighing alternative views ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...846

Phase 4 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 846Conclusions ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 847

Acknowledgements ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 847Declaration of interest ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 848References ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 848

Introduction

A scientifically sound assessment of the potential risks

associated with a substance requires a systematic, objective

and transparent evaluation of the weight of evidence (WoE)

for causality of health effects. Many different methods for

conducting WoE evaluations have been used by authoritative

bodies and reported in the scientific literature (e.g. Adami

et al., 2011; Borgert et al., 2011; ECETOC, 2009; IARC,

2006; Rhomberg et al., 2010, 2011a, 2013; Swaen & van

Amelsvoort, 2009). The United States Environmental

Protection Agency (EPA) uses WoE approaches to assess

the risks of health effects from exposures to chemicals in the

environment, and these assessments are used to support a

wide range of regulatory activities. For example, as part of the

process for setting health-based National Ambient Air Quality

Standards (NAAQS), EPA periodically evaluates the WoE for

potential health risks of exposure to each of six substances

that are classified as criteria air pollutants [particulate matter

(PM), sulfur dioxide, nitrogen dioxide, carbon monoxide,

ozone and lead] because of their national-scale occurrence

and public health significance. These evaluations are pre-

sented as an Integrated Science Assessment (ISA), in which

Address for correspondence: Julie E. Goodman, Gradient, 20 UniversityRoad, Cambridge, MA 02138, USA. Tel: 617-395-5000. E-mail:[email protected]

Page 2: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

EPA assesses the body of relevant scientific literature to draw

conclusions regarding the causal relationships between

exposure to each criteria pollutant and human health and

welfare effects (US EPA, 2013a). Information and conclu-

sions from the ISA are used by EPA in the Risk and Exposure

Assessment (REA) to evaluate human health or environmen-

tal risks associated with recent air quality conditions or air

quality that is estimated to just meet current or alternative

standards (US EPA, 2013b). In its Policy Assessment, EPA

bridges the gap between the scientific assessments in the ISA

and REA and judgments required by the EPA Administrator

regarding the retention or revision of any of the four elements

of each NAAQS (i.e. indicator, averaging time, level, and

statistical form) (US EPA, 2013b). In the ISA, decisions

regarding causality near the level of the standard carry

substantial weight in the Administrator’s judgments

(Bachmann, 2007; McClellan, 2012).

The current WoE framework for causal determination in

the NAAQS process (referred to herein as the ‘‘NAAQS

causal framework’’) was first introduced in the ISAs for

oxides of nitrogen and sulfur (US EPA, 2008a,b) and updated

in the ISA for PM (US EPA, 2009); it has been refined since

in the ISAs for carbon monoxide (US EPA, 2010a), ozone (US

EPA, 2013a), and lead (US EPA, 2012).

In addition to assessments of air pollutants in the NAAQS

process, EPA evaluates potential human health effects from a

broad variety of chemicals in its Integrated Risk and

Information System (IRIS) Program. An example of this is

EPA’s 2010 draft IRIS assessment of formaldehyde (US EPA,

2010b), for which the evidence for potential adverse effects

was compiled and evaluated in a similar manner as was the

evidence in the criteria pollutant ISAs. In 2011, a panel

assembled by the National Research Council (NRC) of the

National Academy of Sciences (NAS) published its Review of

the Environmental Protection Agency’s Draft IRIS Assessment

of Formaldehyde (NRC, 2011). In this report, the NAS panel

identified several issues with the WoE methodology used by

EPA, including some that were identified by other NRC

committees that had reviewed a number of IRIS assessments

over the previous decade. The panel called on EPA to

undertake a program to develop a transparent and defensible

methodology for WoE evaluations to improve the agency’s

IRIS assessments.

The NAS panel proposed a ‘‘roadmap’’ for reform and

improvement of the risk assessment process, noting that there

are many proven models for evidence-based reviews that

might provide guidance to EPA, including systematic WoE

approaches for hazard identification. The panel recommended

that EPA’s WoE evaluation include clear narratives that

provide the rationale for conclusions; furthermore, the panel

noted that conclusions suggesting an association in the face of

mixed results (e.g. positive, weak, and null studies for a given

endpoint) should be accompanied by a thorough discussion of

the strengths and weaknesses of the underlying studies. The

NAS panel stated that EPA’s current process for NAAQS

reviews, including the development of an ISA, is a useful

example of revising an established process for conducting

evaluations in a relatively short period of time. It noted,

however, that the current NAAQS process is not an example

of a specific approach for evidence review that should be

adopted for revision of EPA’s IRIS process. In fact, several

issues that the NAS panel noted with the formaldehyde

assessment could also be considered applicable to revision

and improvement of the NAAQS WoE evaluation

methodology.

After describing the NAAQS causal framework below, we

discuss best practices for WoE evaluations gleaned from other

available WoE frameworks and note approaches that should

be stated explicitly in the NAAQS causal framework to

decrease the likelihood of biased evaluations. We then present

a case study in which we cite several examples from the ozone

ISA that demonstrate how EPA’s use of the current NAAQS

causal framework leads to an inconsistent evaluation that

often leads to conclusions of causality that are not fully

supported by the evidence.

The NAAQS causal framework

The current NAAQS causal framework used by EPA to

conduct evaluations of potential health effects associated with

criteria air pollutants is described in the Preamble section of

the recent lead and ozone ISAs (US EPA, 2012, 2013a). The

Preamble states that the NAAQS causal framework draws

from language in sources across the federal government and

scientific community, particularly the Institute of Medicine

(IOM) report Improving the Presumptive Disability Decision-

making Process for Veterans (IOM, 2008). Although the IOM

(2008) report provides guidance for making decisions

regarding health effects in veterans, and not from exposures

to criteria air pollutants, the Preamble notes that it is

applicable because it is a comprehensive report on the

evaluation of causality.

The NAAQS causal framework is used to determine the

WoE in support of causation and characterize the strength of

any resulting causal classification after a consideration of the

evidence. In addition, EPA uses the causality classification to

decide if a quantitative risk assessment will be conducted on a

particular health effect. Table 1 lists the general steps of the

NAAQS causal framework.

In the Preamble of the ozone ISA, EPA first discusses how

studies are identified for inclusion in the evaluation and then

describes the framework itself. For brevity, and because study

identification and selection are necessary steps prior to an

evaluation of causality, hereafter we include these initial steps

of study identification and selection methods described in the

Preamble, as well as study evaluation and integration, when

referring to the NAAQS causal framework. As part of the

identification step, for each periodic NAAQS review, EPA

maintains an ongoing literature search process to identify

relevant studies published since the last NAAQS review of a

given pollutant. The Preamble states that search strategies are

designed for each pollutant and iteratively modified to

optimize the identification of pertinent studies. Studies are

also identified for inclusion through reviewing the tables of

contents of journals in which relevant papers may be

published, identification of relevant studies by expert scien-

tists, reviewing citations from previous assessments, and

identification of papers by advisory committees and the

public. Studies are considered for inclusion if they have

undergone scientific peer review or if they include analyses

830 J. E. Goodman et al. Crit Rev Toxicol, 2013; 43(10): 829–849

Page 3: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

conducted by EPA using publicly available data. EPA states

that all relevant controlled human exposure, epidemiology

and toxicology studies are considered.

The Preamble in the ozone ISA states that the selection of

studies for inclusion is based on the general scientific quality

of the study, with consideration of the extent to which the

study is informative and relevant to the NAAQS process. The

assessment of study quality and relevance considers factors

such as whether the study subjects, populations or animal

models are adequately selected and sufficiently well-defined

to allow for meaningful comparisons between groups; the

statistical analyses are appropriate and properly performed;

the exposure data are of adequate quality and representative of

ambient conditions; the health effect measurements are

meaningful, valid and reliable; and the analytical methods

provide adequate sensitivity and precision to support conclu-

sions. Other considerations discussed by EPA include whether

potential confounders and effect modifiers are assessed in

epidemiology studies; control exposures to filtered air are

included in controlled human exposure studies; and there is

sufficient statistical power to assess findings in both

controlled human exposure and animal toxicology studies.

EPA states that of most relevance for study inclusion is

whether it provides useful qualitative or quantitative infor-

mation regarding exposure–response relationships at expos-

ures relevant to ambient conditions that can inform decisions

on whether to retain or change the NAAQS.

After the discussion of study selection, the Preamble

describes the NAAQS causal framework. This includes a

discussion of some of the general strengths and limitations

that should be considered when evaluating controlled human

exposure, epidemiology, and animal toxicology studies. The

framework uses a modified Bradford Hill approach (Hill,

1965) to aid in judgments regarding causal determinations,

including consideration of criteria (as they are commonly

referred to) such as consistency of observed associations,

coherence, biological plausibility, biological gradient,

strength of the observed association, experimental evidence,

temporality, specificity and analogy (Table 2). The original

Bradford Hill criteria were developed mainly for the inter-

pretation of epidemiology results and are not meant to be

specific rules to follow. EPA modified them in the NAAQS

causal framework for use with the broad array of study types

that are considered in the ISAs (e.g. epidemiology, toxicology,

mechanistic), to be more consistent with EPA’s Guidelines for

Carcinogen Risk Assessment (US EPA, 2005). Regarding how

Table 1. EPA NAAQS process for causal determination.

Step

Literature searching Specialized searches on specific topicsReview tables of contents of relevant journalsIdentification by expert scientistsReview citations in previous assessmentsIdentification by the public and advisory committeesIterative modification to optimize identification of pertinent publications

Selection of studies for inclusion Peer-reviewedEPA analyses of publicly-available dataBased on study qualityConsiders policy-relevance of studiesConsiders relevance to ambient exposures

Consideration of general limitations of each study type Controlled human exposureEpidemiologyToxicology

Use of modified Bradford Hill aspects to aid in judging causality ConsistencyCoherenceBiological plausibilityBiological gradientStrength of associationExperimental evidenceTemporalitySpecificityAnalogy

Evaluate evidence for major health outcome categories

Integrate evidence from across disciplines (controlled exposure, epi-demiology, toxicology) and across health endpoints

Incorporate peer and public comment and advice

Weigh alternative views on controversial issues

Characterize strength of evidence into causal conclusions Causal relationshipLikely to be a causal relationshipSuggestive of a causal relationshipInadequate to infer a causal relationshipNot likely to be a causal relationship

Assess adversity of effects

Source: US EPA (2013a).

DOI: 10.3109/10408444.2013.837864 NAAQS Causal Framework Evaluation 831

Page 4: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

the modified criteria, or ‘‘aspects’’ as EPA refers to them, are

to be used in the framework, EPA states:

Although these aspects provide a framework for assessing

the evidence, they do not lend themselves to being

considered in terms of simple formulas or fixed rules of

evidence leading to conclusions about causality (Hill,

1965). For example, one cannot simply count the number

of studies reporting statistically significant results or

statistically nonsignificant results and reach credible

conclusions about the relative weight of the evidence and

the likelihood of causality. Rather, these aspects provide a

framework for systematic appraisal of the body of

evidence, informed by peer and public comment and

advice, which includes weighing alternative views on

controversial issues. (US EPA, 2013a)

The NAAQS causal framework suggests that evidence be

evaluated for major health outcome categories (e.g. respira-

tory effects) and conclusions be drawn based on the

integration of evidence from across health disciplines and

across the spectrum of related health endpoints (e.g. health

effects ranging from inflammation of the lungs to respiratory

disease mortality). In drawing judgments, there is a focus on

evidence of effects in the range of relevant human exposures

to a given pollutant, with a general consideration of studies

with doses or exposures in the range of no more than one or

two orders of magnitude above current or ambient conditions.

In discussing causal determinations, EPA characterizes the

evidence on which the judgment is based, including the

strength of evidence for individual endpoints within each

major health outcome category. Based on these characteriza-

tions, conclusions regarding the WoE for causation are

classified using a five-level hierarchy: ‘‘Causal relationship’’;

‘‘Likely to be a causal relationship’’; ‘‘Suggestive of a causal

relationship’’; ‘‘Inadequate to confer a causal relationship’’

and ‘‘Not likely to be a causal relationship’’ (Table 1). The

framework also discusses several factors for consideration in

delineating between adverse and non-adverse health effects

resulting from exposure to air pollution.

Table 2. EPA aspects to aid in judging causality.

Aspect Description

Consistency of the observed association An inference of causality is strengthened when a pattern of elevated risks is observedacross several independent studies. The reproducibility of findings constitutes one ofthe strongest arguments for causality. If there are discordant results amonginvestigations, possible reasons such as differences in exposure, confounding factors,and the power of the study are considered.

Coherence An inference of causality from one line of evidence (e.g. epidemiologic, controlledhuman exposure [clinical], or animal studies) may be strengthened by other lines ofevidence that support a cause-and-effect interpretation of the association. Thecoherence of evidence from various fields greatly adds to the strength of an inferenceof causality. In addition, there may be coherence in demonstrating effects acrossmultiple study designs or related health endpoints within one scientific line ofevidence.

Biological plausibility An inference of causality tends to be strengthened by consistency with data fromexperimental studies or other sources demonstrating plausible biological mechan-isms. A proposed mechanistic linking between an effect and exposure to the agent isan important source of support for causality, especially when data establishing theexistence and functioning of those mechanistic links are available.

Biological gradient (exposure–response relationship) A well-characterized exposure–response relationship (e.g. increasing effects associatedwith greater exposure) strongly suggests cause and effect, especially when suchrelationships are also observed for duration of exposure (e.g. increasing effectsobserved following longer exposure times).

Strength of the observed association The finding of large, precise risks increases confidence that the association is not likelydue to chance, bias or other factors. However, it is noted that a small magnitude in aneffect estimate may represent a substantial effect in a population.

Experimental evidence Strong evidence for causality can be provided through ‘‘natural experiments’’ when achange in exposure is found to result in a change in occurrence or frequency of healthor welfare effects.

Temporal relationship of the observed association Evidence of a temporal sequence between the introduction of an agent, and appearanceof the effect, constitutes another argument in favor of causality.

Specificity of the observed association Evidence linking a specific outcome to an exposure can provide a strong argument forcausation. However, it must be recognized that rarely, if ever, does exposure to apollutant invariably predict the occurrence of an outcome, and that a given outcomemay have multiple causes.

Analogy Structure activity relationships and information on the agent’s structural analogs canprovide insight into whether an association is causal. Similarly, information on modeof action for a chemical, as one of many structural analogs, can inform decisionsregarding likely causality.

Source: US EPA (2013a, Table I).

832 J. E. Goodman et al. Crit Rev Toxicol, 2013; 43(10): 829–849

Page 5: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

As discussed above, the NAAQS causal framework is

largely based on an IOM (2008) WoE framework. The current

IOM framework has four categories of causal determination,

with only the top two designating evidence that establishes a

causal relationship (top category) or a relationship for which

causality is ‘‘at least as likely as not’’ (second category)

(Table 3). In contrast, as shown in Table 4, the NAAQS

causal framework has five categories of causal determination,

with the top three designating evidence that establishes a

causal relationship (i.e. causal, likely causal, or suggestive

of a causal relationship). Further, in contrast to the IOM

approach, the NAAQS causal framework states that only

one positive study is sufficient to establish a suggestive

causal relationship when the results of other studies are

inconsistent; this can greatly increase the frequency of

positive causal conclusions.

Table 4. EPA’s weight of evidence for causal determination.

Causal determination Health effects

Causal relationship Evidence is sufficient to conclude that there is a causal relationship with relevant pollutant exposures (i.e.doses or exposures generally within one to two orders of magnitude of current levels). That is, thepollutant has been shown to result in health effects in studies in which chance, bias, and confoundingcould be ruled out with reasonable confidence. For example: (a) controlled human exposure studiesthat demonstrate consistent effects; or (b) observational studies that cannot be explained by plausiblealternatives or are supported by other lines of evidence (e.g. animal studies or mode of actioninformation). Evidence includes multiple high-quality studies.

Likely to be a causal relationship Evidence is sufficient to conclude that a causal relationship is likely to exist with relevant pollutantexposures, but important uncertainties remain. That is, the pollutant has been shown to result in healtheffects in studies in which chance and bias can be ruled out with reasonable confidence but potentialissues remain. For example: (a) observational studies show an association, but copollutant exposuresare difficult to address and/or other lines of evidence (controlled human exposure, animal, or mode ofaction information) are limited or inconsistent; or (b) animal toxicological evidence from multiplestudies from different laboratories that demonstrate effects, but limited or no human data are available.Evidence generally includes multiple high-quality studies.

Suggestive of a causal relationship Evidence is suggestive of a causal relationship with relevant pollutant exposures, but is limited. Forexample, (a) at least one high-quality epidemiologic study shows an association with a given healthoutcome but the results of other studies are inconsistent; or (b) a well-conducted toxicological study,such as those conducted in the National Toxicology Program (NTP), shows effects in animal species.

Inadequate to infer a causal relationship Evidence is inadequate to determine that a causal relationship exists with relevant pollutant exposures.The available studies are of insufficient quantity, quality, consistency, or statistical power to permit aconclusion regarding the presence or absence of an effect.

Not likely to be a causal relationship Evidence is suggestive of no causal relationship with relevant pollutant exposures. Several adequatestudies, covering the full range of levels of exposure that human beings are known to encounter andconsidering at-risk populations, are mutually consistent in not showing an effect at any level ofexposure.

Source: US EPA (2013a, Table II).

Table 3. IOM recommended categories for the level of evidence for causation.

Causal determination Evidence

Sufficient The evidence is sufficient to conclude that a causal relationship exists. For example: (a) replicated and consistent evidenceof an association from several high-quality epidemiologic studies that cannot be explained by plausible noncausalalternatives (e.g. chance, bias or confounding); or (b) evidence of causation from animal studies and mechanisticknowledge; or (c) compelling evidence from animal studies and strong mechanistic evidence from studies in exposedhumans, consistent with (i.e. not contradicted by) the epidemiologic evidence.

Equipoise and above The evidence is sufficient to conclude that a causal relationship is at least as likely as not, but not sufficient to conclude thata causal relationship exists. For example: (a) evidence of an association from the preponderance of several high-qualityepidemiologic studies that cannot be explained by plausible noncausal alternatives (e.g. chance, bias or confounding) aswell as animal evidence and biological knowledge consistent with a causal relationship; or (b) strong evidence fromanimal studies or mechanistic evidence that is not contradicted by human or other evidence.

Below equipoise The evidence is not sufficient to conclude that a causal relationship is at least as likely as not, or is not sufficient to makea scientifically informed judgment. For example: (a) consistent human evidence of an association that is limited bythe inability to rule out chance, bias, or confounding with confidence, and weak animal or mechanistic evidence; or(b) animal evidence suggestive of a causal relationship, but weak or inconsistent human and mechanistic evidence; or(c) mechanistic evidence suggestive of a causal relationship, but weak or inconsistent animal and human evidence; or(d) the evidence base is very thin.

Against The evidence suggests the lack of a causal relationship. For example: (a) consistent human evidence of no causalassociation from multiple studies covering the full range of exposures encountered by humans; or (b) animal ormechanistic evidence supportive of a lack of a causal relationship.

Source: IOM (2008).

DOI: 10.3109/10408444.2013.837864 NAAQS Causal Framework Evaluation 833

Page 6: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

There are many valuable features in the current NAAQS

causal framework. These include literature searching that is

iterative, consideration of modified Bradford Hill criteria to

aid in judgments regarding causal determinations, weighing

alternative views on controversial issues, integration of

judgments of the evidence across disciplines and endpoints,

focusing on evidence of health effects in the range of human-

relevant exposures, consideration of the adversity of health

outcomes and a hierarchy for categorization of the strength of

the evidence.

Some aspects of the NAAQS causal framework could be

more specific, however, and the framework does not explicitly

include several features that, we would argue, are critical for

a thorough and internally consistent WoE evaluation. For

example, there is no explicit guidance for literature search

strategies or determining study exclusion criteria. There is

also no explicit guidance for evaluating the strengths and

weaknesses of studies and no information on how to use study

quality criteria to assign a quality weight to individual studies.

There are no clear statements regarding how the Bradford Hill

criteria should be applied (e.g. there is no indication of what

constitutes a strong association) or how the causality judg-

ments should consider all criteria jointly. There is also no

guidance on how to evaluate alternative hypotheses that are

supported by the data (e.g. a given substance is not a causal

factor for a health effect and apparent associations are

explicable by other factors) or what criteria should be used

to determine which hypothesis is most supported by the data.

As indicated in the NAS roadmap, EPA needs to state

the justification for its judgments explicitly – including the

reasoning behind judgments – to aid transparency.

Because of these and other limitations discussed below,

EPA’s NAAQS causal framework has not always been applied

consistently in evaluations of causality for individual sub-

stances – much less across substances – and adequate

transparency regarding the justification of specific causal

judgments is not achieved in the criteria air pollutant ISAs.

We provide specific examples of this with a case study of the

ozone ISA (US EPA, 2013a) below.

WoE best practices

We recently conducted a survey of existing WoE frameworks,

including the NAAQS causal framework, to evaluate WoE

best practices (Rhomberg et al., 2013). From these frame-

works, we identified four successive phases to the overall

WoE process that are consistent with the critical steps for the

development of a scientifically sound assessment of the WoE

for causation, as identified by the NAS panel that reviewed

EPA’s formaldehyde assessment. These critical steps should

be considered by EPA in developing an updated NAAQS

framework. Below, we describe these phases; Table 5 presents

their general features.

Phase 1. Define the causal question and develop criteria

for study selection. Frame the purpose of the evaluation and

the causal questions to be evaluated, and define the criteria for

selecting studies relevant to the evaluation to ensure trans-

parency. Prior to this phase, there is a separate role for

problem formulation, which could be considered as ‘‘Phase

0’’ and entails the evaluation of what is needed to address the

decisions to be made, along with an assessment of the

capacity of an analysis of available data to provide answers

addressing those needs.

Phase 2. Develop and apply criteria for review of

individual studies. Conduct and present a systematic and

consistent review of available studies relevant to the causal

Table 5. Weight-of-evidence best practices.

Phase Steps

Phase 1Define the causal question and develop criteria for study selection Define causal question or hypothesis

Define criteria for study inclusion and exclusionPlan literature searchDesign literature search strategiesScreen and select studies, with justification for exclusionRecord search strategies, results, and decisions

Phase 2Develop and apply criteria for review of individual studies Extract study characteristics

Extract dataAssess study qualityCategorize study qualityAssess individual study resultsCategorize study relevance and adequacy

Phase 3Integrate and evaluate evidence Evaluate data within and across realms of evidence

(e.g. toxicology, epidemiology, MoA)Assess MoAAssess adversity of effectsCompare alternative accounts of observationsFormulate WoE conclusionsIdentify data gaps and propose next steps

Phase 4Draw conclusions based on inferences Propose categories for causal relationships

Propose recommendations for risk assessmentPropose recommendations for research or policies

Source: Rhomberg et al. (2013).

834 J. E. Goodman et al. Crit Rev Toxicol, 2013; 43(10): 829–849

Page 7: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

question. Evaluate the rigor and quality of individual study

results using pre-defined criteria applied uniformly across

studies.

Phase 3. Integrate and evaluate evidence. Make sound

and defensible scientific judgments about the existence and

nature of causative processes for the health outcome under

consideration. This is one of the more challenging phases for

any WoE framework; no matter how one lays out procedures

and methods for synthesizing across studies, in the end, the

question is about how studies in one setting (e.g. animal or

in vitro assays) should affect our assessment of potential

causality or risks in another (e.g. the general human

population exposed environmentally).

Phase 4. Draw conclusions based on inferences. Apply the

results of the WoE evaluation from Phase 3 to make

conclusions that can be used to inform regulatory decision-

making. Although this phase is not risk management itself, it

can be influenced by risk management considerations. In a

regulatory setting, decisions about WoE categories (‘‘known’’

causative agent, ‘‘likely’’ causative agent, etc.) or findings

about the science [sufficient evidence for a mode of action

(MoA) or to replace a default assumption for developing a

toxicity value] are influenced in this stage by policy questions

and regulatory consequences for those decisions and, ultim-

ately, by policies and judgments about the sufficiency of

evidence to support those decisions.

The most important aspect of a WoE framework is that it

provides specific and transparent guidance for how to work

through the WoE process. At the same time, the framework

needs to be flexible so that it can be applied in a consistent

manner across different types of datasets for evaluations of

causality. A WoE evaluation is only useful and applicable to

constructive scientific debate if the logic behind it is made

clear, and, with that, it is often necessary to take the reader

through alternative interpretations of the data so that the

various interpretations can be compared logically. This

approach does not eliminate the need for scientific judgment,

and often may not lead to a definitive choice of one

interpretation over the other, but it will clearly lay out the

logic for how one weighs the evidence for and against each

interpretation. Only in this way is it possible to have

constructive scientific debate about potential causality that

is focused on an organized, logical ‘‘weighing’’ of the

evidence.

The NAAQS causal framework provides guidance on

many features of the WoE phases noted above. Some of them

are explicit and some are clearly included in the framework,

but specific guidance is not always provided (e.g. study

exclusion criteria). As a result, evaluations are not always

consistent either among studies for a particular criteria

pollutant or among pollutants, as discussed below in the

case study of ozone.

Table 5 outlines WoE best practices generally and Table 6

outlines features of the NAAQS causal framework, as well as

additional features that we recommend based on best

practices. These additions will enable the framework to

include these WoE best practices and help ensure that they are

applied consistently. Several of the recommended features

may have, in practice, been incorporated into WoE evalu-

ations in the criteria pollutant ISAs; however, if they are not

explicitly stated as guidance in the discussion of the NAAQS

causal framework in the ISA Preamble, or if they are

discussed as occurring in a Phase that does not match

where they should occur according to WoE best practices,

they are still noted in Table 6 as additional, best practice

features. Below, we discuss these features by phase.

Phase 1: Define the causal question and developcriteria for study selection

The causal question and breadth of scope should be explicitly

stated in Phase 1 of a WoE evaluation to provide a clear

purpose for the assessment as well as direction for the

remaining steps. Defining the causal question at the beginning

of the process helps ensure that the correct question is posed,

and it sets the context within which the bearing and utility of

studies can be evaluated. Study selection criteria and the

literature search design should be articulated clearly (Tables 5

and 6). The end product of Phase 1 is a clearly defined causal

question, a documented search strategy with clear inclusion

and exclusion criteria, a list of included and excluded studies

with reasons for inclusion or exclusion, and a record of any

deviations from the original plan.

In the NAAQS causal framework, the definition of the

causal question is inferred from the purpose of the ISA in the

NAAQS process, which is to provide a ‘‘concise review,

synthesis, and evaluation of the most policy-relevant science

to serve as a scientific foundation for the review of the

[NAAQS]’’ (US EPA, 2013a). EPA does not, but in our view

should, explicitly resolve the overarching causal question into

specific issues for problem formulation. For example, EPA

could specify aspects of the analysis regarding what is

relevant to existing human exposures, the basis for what

constitutes sensitive subpopulations, whether exposure to

other agents affects response, and interactions with other

stressors. In other words, by specifying whether the purpose is

a ‘‘hazard in context’’ decision or a more general identifica-

tion of a potential hazard at any exposure level, this would

help to guide decisions on whether or not it is found in the

actual population. Problem formulation should try to identify

the specific sub-judgments that need to be made to support

the overall goal, with the aim of enabling assessment of

whether the available data speak to and enable sound

decisions on them.

With regard to the literature search, the Preamble of the

ozone ISA provides an overview of EPA’s general literature

search strategy and the various ways in which studies are

identified for potential inclusion. Some of the study inclusion

criteria are also presented, and an evaluation of study quality

is included in the study selection process. However, the

Preamble does not provide guidance regarding study exclu-

sion criteria or justification for inclusion or exclusion of

studies. As shown in Table 6, the NAAQS causal framework

should include explicit guidance regarding literature search

strategies for identifying all available evidence, as well as

criteria for both study inclusion and exclusion (e.g. study

type, types of participants, exposures, outcomes). While

relevance to ambient exposures to criteria pollutants is

important to consider at this point, allowances may be made

for scientific information that does not meet inclusion criteria

DOI: 10.3109/10408444.2013.837864 NAAQS Causal Framework Evaluation 835

Page 8: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

Tab

le6

.M

od

ifie

dN

AA

QS

cau

sal

fram

ewo

rk.

Ste

pN

ote

s

Ph

ase

1

Def

ine

cau

sal

qu

esti

on

or

hy

po

thes

isIn

ferr

edfr

om

NA

AQ

Sp

roce

ss(a

sis

pro

ble

mfo

rmu

lati

on

)D

eter

min

eb

rea

dth

of

sco

pe

Exp

lore

revi

ewq

ues

tio

ns

no

tid

enti

fied

ap

rio

riD

evel

oped

duri

ng

the

iter

ativ

epro

cess

of

the

lite

ratu

rese

arch

Def

ine

crit

eria

for

stu

dy

incl

usi

on

an

dex

clu

sio

nM

ayocc

ur

inse

ver

alst

ages

asnew

revie

wques

tions

are

iden

tifi

edbas

edon

lite

ratu

rese

arch

Det

erm

ine

stu

dy

typ

esD

efin

ety

pes

of

par

tici

pan

tsD

efin

eex

posu

res/

risk

fact

ors

/inte

rven

tions

Def

ine

ou

tco

mes

Co

nsi

der

do

se

Pla

nli

tera

ture

searc

h�

Iter

ativ

em

od

ific

atio

ns

too

pti

miz

eid

enti

fica

tio

no

fp

erti

nen

tp

ub

lica

tio

ns

�S

pec

iali

zed

sear

ches

on

spec

ific

topic

sIn

volv

eli

bra

ria

ns

Ch

oo

sed

ata

ba

ses

Iden

tify

arti

cles

by

exp

erts

,th

ep

ub

lic,

and

adv

iso

ryco

mm

itte

esR

evie

wre

fere

nce

list

s,ta

ble

of

con

ten

tso

fre

levan

tjo

urn

als,

cita

tio

ns

inp

rev

iou

sas

sess

men

tsS

earc

hgre

yli

tera

ture

Iden

tify

on

go

ing

/un

pu

bli

shed

stu

die

s

Des

ign

lite

ratu

rese

arc

hst

rate

gie

sC

ho

ose

key

wo

rds

Det

erm

ine

lan

gu

age,

da

te,

an

dd

ocu

men

tfo

rma

tre

qu

irem

ents

Scr

een

an

dse

lect

stu

die

s,w

ith

just

ific

ati

on

for

excl

usi

on

�A

sses

sag

reem

ent

of

stu

dy

sele

ctio

nam

on

gin

ves

tigat

ors

�In

clu

de

pee

r-re

vie

wed

stu

die

san

dE

PA

eval

uat

ion

so

fp

ub

lica

lly

avai

lab

led

ata

�C

on

sid

erre

levan

ceto

amb

ien

tex

po

sure

s�

Mak

eal

low

ance

sfo

rp

iece

so

fin

form

atio

nfr

om

area

so

uts

ide

of

incl

usi

on

/excl

usi

on

crit

eria

that

may

imp

act

eval

uat

ion

�D

on

ot

excl

ud

est

ud

ies

ba

sed

on

qu

ali

tyo

rif

stu

dy

rep

ort

sn

ull

or

neg

ati

vefi

nd

ing

sa

tth

isp

oin

t�

Do

no

tco

nsi

der

po

licy

-rel

eva

nce

of

stu

die

s

Rec

ord

sea

rch

stra

teg

ies,

resu

lts,

an

dd

ecis

ion

sS

ort

refe

ren

ces

inb

ibli

og

rap

hic

soft

war

e

Ph

ase

2

Ex

tra

ctst

ud

ych

ara

cter

isti

csT

hes

ein

clu

de

pa

rtic

ipa

nts

an

dse

ttin

g,

met

ho

ds,

inte

rven

tio

ns,

ou

tco

me

mea

sure

s,re

sult

s,a

nd

qu

ali

tya

spec

tsD

evel

op

qu

ali

tyco

ntr

ol

for

da

taen

try

Det

erm

ine

dis

agre

emen

tre

solu

tio

n

Ex

tra

ctd

ata

�S

ho

uld

be

ba

sed

on

rele

van

ceto

cau

sal

qu

esti

on

�M

ake

sure

all

rele

van

td

ata

are

extr

act

ed(i

.e.

itis

no

tsu

ffic

ien

tto

extr

act

asu

bse

to

fre

leva

nt

da

ta)

Det

erm

ine

da

taca

teg

ori

esO

rga

niz

eev

iden

cein

toco

nsi

sten

tse

tso

fca

teg

ori

esC

ate

go

rize

hea

lth

effe

ct(s

)b

ysp

ecif

icit

ya

nd

late

ncy

836 J. E. Goodman et al. Crit Rev Toxicol, 2013; 43(10): 829–849

Page 9: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

Ass

ess

stu

dy

qu

ali

tyD

iscu

ssgen

eral

lim

itat

ion

sto

con

sid

erfo

rea

chst

ud

yty

pe

(e.g

.co

ntr

oll

edh

um

anex

po

sure

,ep

idem

iolo

gy,

tox

ico

log

y)

Stu

dy

des

ign

Incl

ud

ea

sses

smen

to

fco

un

terf

act

ua

lity

of

stu

dy

des

ign

Str

eng

ths

an

dw

eakn

esse

sE

xp

osu

re/i

nte

rven

tio

nas

sess

men

tA

ssig

nh

igh

erw

eig

ht

tost

ud

ies

wit

hd

irec

tex

po

sure

mea

sure

men

t,ab

sen

ceo

fco

-po

llu

tan

tex

po

sure

s,o

ru

seo

fm

ult

i-p

oll

uta

nt

mo

del

sC

on

fou

nd

ers

Det

erm

ine

ifp

ote

nti

alco

nfo

un

der

sid

enti

fied

and

ifp

arti

ally

or

full

yad

dre

ssed

by

auth

ors

Sys

tem

ati

cer

ror

(bia

s)P

reci

sio

no

fm

easu

rem

ents

Rep

lica

bil

ity

of

ob

serv

ati

on

sR

elia

bil

ity

of

da

taS

tati

stic

alm

eth

od

sA

ssig

nh

igh

erw

eig

ht

tost

ud

ies

that

use

dst

atis

tica

lm

eth

od

sto

det

ect

and

con

tro

lfo

rco

nfo

un

din

gan

dm

ult

iple

com

par

iso

ns,

wh

enn

eces

sary

Ou

tlie

rsSel

ecti

veoutc

om

ere

port

ing

Iden

tify

fra

ud

ule

nt

stu

die

sE

lim

inat

efr

om

anal

ysi

s.

Ca

teg

ori

zest

ud

yq

ua

lity

�Q

ua

lita

tive

lyo

rb

yra

nki

ng

,gra

din

g,

or

sco

rin

g�

Do

no

tel

imin

ate

an

yst

ud

ies

ba

sed

on

wea

knes

s(u

nle

ssfa

tall

yfl

aw

eda

nd

no

tin

terp

reta

ble

)

Ass

ess

ind

ivid

ua

lst

ud

yre

sult

sS

tren

gth

of

ass

oci

ati

on

Det

erm

ine

ap

rio

rith

ed

efin

itio

no

fa

‘‘st

ron

g’’

ass

oci

ati

on

(e.g

.ri

skes

tim

ate�

3).

Use

the

sam

ecr

iter

iafo

rju

dg

ing

som

eth

ing

as

ari

skfa

cto

rvs

.a

pro

tect

ive

fact

or

(e.g

.a

risk

esti

ma

teo

f0

.99

2sh

ou

ldca

rry

the

sam

ew

eig

ht

as

ari

skes

tim

ate

of

1.0

08

)In

tern

al

con

sist

ency

Res

ult

su

sin

gd

iffe

ren

tm

od

els

sho

uld

be

rela

tive

lyco

nsi

sten

tB

iolo

gic

al

pla

usi

bil

ity

Bo

thd

ata

ind

ica

tin

gb

iolo

gic

al

pla

usi

bil

ity

an

da

lack

of

bio

log

ica

lp

lau

sib

ilit

ysh

ou

ldb

eco

nsi

der

edTem

po

rali

tyIt

isn

ot

on

lyth

at

the

expo

sure

mu

sto

ccu

rb

efo

rea

nef

fect

,b

ut

tha

tit

occ

urs

wit

hin

an

ap

pro

pri

ate

tim

efr

am

eD

ose

-res

po

nse

or

expo

sure

–re

spo

nse

Ifin

crea

sin

gef

fect

sa

reo

bse

rved

wit

hin

crea

sin

gex

po

sure

so

rd

ura

tio

no

fex

po

sure

s,th

isis

evid

ence

for

aca

usa

lre

lati

on

ship

;a

lack

of

this

ass

oci

ati

on

isev

iden

cea

ga

inst

aca

usa

lre

lati

on

ship

Va

ria

tio

n/h

isto

rica

lco

ntr

ol

rate

sD

eter

min

eh

ow

dep

end

ent

resu

lts

are

on

valu

esin

the

con

tro

lgro

up

,p

art

icu

larl

yif

they

dif

fer

fro

mh

isto

rica

lco

ntr

ols

Ou

tco

me

ass

essm

ent

Ra

nd

om

erro

r(c

ha

nce

)

Ca

teg

ori

zest

ud

yre

leva

nce

an

da

deq

ua

cy�

Det

erm

ine

wh

eth

eran

dto

wh

atex

ten

tre

sult

sar

ere

levan

tto

amb

ien

tex

po

sure

s�

Det

erm

ine

app

rop

riat

enes

san

du

sefu

lnes

so

fth

ed

ata

for

haz

ard

and

risk

asse

ssm

ent

Ph

ase

3

Eva

lua

ted

ata

wit

hin

an

da

cro

ssre

alm

so

fev

iden

ce(e

.g.

tox

ico

log

y,ep

idem

iolo

gy,

Mo

A)

�In

tegra

ted

ata

acr

oss

all

lin

eso

fev

iden

ceso

inte

rpre

tati

on

of

on

ew

ill

info

rmin

terp

reta

tio

no

fth

eo

ther

�A

sses

sa

lld

ata

,in

clu

din

gn

ega

tive

,n

ull

,a

nd

po

siti

vere

sult

s�

Ass

ign

less

wei

gh

tto

resu

lts

of

stu

die

so

flo

wer

qu

ali

ty�

Inco

rpo

rate

pee

ran

dp

ub

lic

com

men

tan

dad

vic

eS

tren

gth

of

asso

ciat

ion

Det

erm

ine

ap

rio

rith

ed

efin

itio

no

fa

‘‘st

ron

g’’

ass

oci

ati

on

(e.g

.ri

skes

tim

ate�

3).

Use

the

sam

ecr

iter

iafo

rju

dg

ing

som

eth

ing

as

ari

skfa

cto

rvs

.a

pro

tect

ive

fact

or

(e.g

.a

risk

esti

ma

teo

f0

.99

2sh

ou

ldca

rry

the

sam

ew

eig

ht

as

ari

skes

tim

ate

of

1.0

08

)C

on

sist

ency

of

asso

ciat

ion

sA

sso

cia

tio

ns

sho

uld

be

con

sist

ent

bo

thw

ith

ina

nd

acr

oss

stu

die

s,p

art

icu

larl

yth

ose

wit

hd

iffe

ren

tst

ud

yd

esig

ns

Co

her

ence

Res

ult

sa

cro

ssa

llli

nes

of

evid

ence

sho

uld

be

coh

eren

t;if

no

t,sp

ecie

scl

ose

stto

hu

ma

ns

sho

uld

be

con

sid

ered

toh

ave

mo

rere

leva

nce

toh

um

an

s

(co

nti

nu

ed)

DOI: 10.3109/10408444.2013.837864 NAAQS Causal Framework Evaluation 837

Page 10: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

Tab

le6

.C

on

tin

ued

Ste

pN

ote

s

Bio

log

ical

pla

usi

bil

ity

Bo

thd

ata

ind

ica

tin

gb

iolo

gic

al

pla

usi

bil

ity

an

da

lack

of

bio

log

ica

lp

lau

sib

ilit

ysh

ou

ldb

eco

nsi

der

edB

iolo

gic

alg

rad

ien

t(d

ose

-res

po

nse

)If

incr

easi

ng

effe

cts

are

ob

serv

edw

ith

incr

easi

ng

exp

osu

res

or

du

rati

on

of

expo

sure

s,th

isis

evid

ence

for

aca

usa

lre

lati

on

ship

;a

lack

of

this

ass

oci

ati

on

isev

iden

cea

ga

inst

aca

usa

lre

lati

on

ship

Ex

per

imen

tal

evid

ence

Tem

po

rali

tyIt

isn

ot

on

lyth

at

the

expo

sure

mu

sto

ccu

rb

efo

rea

nef

fect

,b

ut

tha

tit

occ

urs

wit

hin

an

ap

pro

pri

ate

tim

efr

am

eS

pec

ific

ity

An

alo

gy

Co

nfo

un

din

gD

eter

min

eif

con

sist

ent

fin

din

gs

ma

yb

ea

resu

lto

fa

con

sist

ent

con

fou

nd

erB

ias

Ass

ess

da

tare

leva

nt

tom

od

eo

fa

ctio

n(M

oA

)P

ost

ula

teM

oA

theo

ryD

escr

ibe

key

even

tsin

Mo

AE

valu

ate

Mo

Ab

ase

do

nB

rad

ford

Hil

lcr

iter

iaC

on

sid

ero

ther

po

ssib

leM

oA

sC

on

sid

eru

nce

rta

inti

es,

inco

nsi

sten

cies

,a

nd

da

taga

ps

for

Mo

AD

eter

min

ew

het

her

Mo

Aes

tab

lish

edin

an

ima

lsA

sses

sw

het

her

key

even

tsin

anim

alM

oA

pla

usi

ble

inh

um

ans

Ass

ess

do

se-r

esp

on

sefo

rke

yev

ents

inM

oA

Ass

ess

ad

vers

ity

of

effe

cts

�A

dap

tive,

com

pen

sato

ry,

tran

sien

t,an

dre

ver

sib

leef

fect

sar

ele

ssli

kel

yto

be

adver

se�

Pre

curs

ors

toap

ical

effe

cts,

sever

eef

fect

s,an

dir

rever

sib

leef

fect

sar

em

ore

likel

yto

be

adver

se

Co

mp

are

alt

ern

ati

ve

acc

ou

nts

of

ob

serv

ati

on

s,in

clu

din

gco

ntr

ove

rsia

lis

sues

�A

sses

sw

het

her

ad

dit

ion

al

evid

ence

sup

po

rts

or

wea

ken

sth

eca

usa

la

ssu

mp

tio

n�

Pro

po

sen

ewh

ypo

thes

is/a

cco

un

to

fev

iden

ceif

itp

rovi

des

mo

reco

nvi

nci

ng

resu

lt�

Iter

ate

ass

essm

ent

ba

sed

on

new

evid

ence

ifre

sult

sa

rea

mb

igu

ou

so

rif

new

hyp

oth

esis

req

uir

essu

pp

ort

Fo

rmu

late

Wo

Eco

ncl

usi

on

sN

ote

ass

um

pti

on

s,es

pec

iall

yw

hen

they

are

adh

oc

Iden

tify

da

taga

ps

an

dp

rop

ose

nex

tst

eps

Ph

ase

4

Pro

po

seca

teg

ori

esfo

rca

usa

lre

lati

on

ship

s

Pro

po

sere

com

men

da

tio

ns

for

risk

ass

essm

ent

Pro

po

sere

com

men

da

tio

ns

for

rese

arc

ho

rp

oli

cies

Cu

rren

tst

eps

inth

eN

AA

QS

cau

sal

fram

ewo

rkar

ep

rese

nte

din

no

n-i

tali

cfo

nt

wit

had

dit

ion

alb

est

pra

ctic

esp

rese

nte

din

ital

icfo

nt.

So

me

of

the

add

itio

nal

bes

tp

ract

ices

may

hav

e,in

pra

ctic

e,b

een

inco

rpo

rate

din

toW

oE

eval

uat

ion

sin

the

crit

eria

po

llu

tan

tIS

As,

bu

tif

they

are

no

tex

pli

citl

yst

ated

asg

uid

ance

inth

ed

iscu

ssio

no

fth

eN

AA

QS

cau

sal

fram

ewo

rkin

the

ISA

Pre

amb

leo

rar

ed

iscu

ssed

aso

ccu

rrin

gin

aP

has

eth

atd

oes

no

tm

atch

wh

ere

they

sho

uld

occ

ur

acco

rdin

gto

Wo

Eb

est

pra

ctic

es,

they

are

pre

sen

ted

init

alic

fon

tas

add

itio

nal

bes

tp

ract

ices

.

838 J. E. Goodman et al. Crit Rev Toxicol, 2013; 43(10): 829–849

Page 11: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

but may nonetheless affect the assessment through their

bearing on interpretation of those studies that do meet

narrower criteria. Studies should be screened and selected,

with justification provided for inclusion or exclusion. All

strategies and decisions should be recorded, particularly those

that occur after the evaluation is underway. It is critical that

studies should not be excluded at this point in the analysis

based on quality, type (e.g. case studies), or if they report null

or negative findings. These factors should be considered in

subsequent phases (as we discuss further below). In addition,

in a scientific evaluation such as in the ISA, no study with

potential scientific use should be excluded based on policy-

relevance. Detailed instructions for conducting literature

searches can be found in the Cochrane Handbook for

Systematic Reviews of Interventions (Higgins & Green,

2011). Although this handbook is focused on medical

interventions, its guidance for literature searches can be

valuable in application to literature searches of any WoE

evaluation.

Phase 2: Develop and apply criteria for review ofindividual studies

In Phase 2, study characteristics and data are extracted, and

study quality is assessed and categorized. Individual study

results are assessed, and study relevance and adequacy are

determined (Tables 5 and 6).

It is crucial that all relevant data presented, (and not just a

subset of those data) are extracted from each study. This may

be an iterative process, as what is considered relevant for

inclusion may change as more studies are reviewed. In

general, however, the data extracted should be based on the

causal question. If positive data are extracted more often than

null data from individual studies, the data will appear to be

more consistent than they actually are. The amount of data

extracted per study is generally quite large; thus, data should

be organized into consistent categories so that they can be

compared within and across studies. For example, specific

health outcomes should be captured for each time period that

they are evaluated in each study, based on all relevant

statistical models, so that results can be compared for

consistency.

The NAAQS causal framework does not provide a clear

account of its study characteristic and data extraction process

(Table 6). Regarding study quality, as EPA notes, one should

consider general limitations for each study type (e.g.

controlled human exposure, epidemiology, toxicology)

(Table 1). For example, EPA states that it assigns higher

weight to studies with direct (i.e. personal) exposure meas-

urement, an absence of co-pollutant exposures, or use of

multi-pollutant models, and studies that use statistical

methods to detect and control for confounding and multiple

comparisons, when necessary. However, there also should be

a discussion of each of these factors that is independent of the

description of individual studies. EPA should discuss all of

the ways in which exposure can be measured, the strengths

and limitations of each method, the possibility for exposure

measurement error, and which methods carry the most

weight. For example, studies that use central-site monitors

to estimate personal exposure should be assigned less weight

than studies that use more refined methods, such as land-use

regression models, or personal monitoring data in the rare

cases it is available, to estimate exposure. There should also

be a discussion of statistical methods used among all studies

evaluated and which specific methods are more robust and

why (e.g. Have multiple comparisons been addressed? Are

assumptions in Cox proportional hazard models appropri-

ate?). Specific confounders should be addressed (e.g.

co-pollutants, socioeconomic status, age) in terms of how

they are handled in different studies and their likely impact on

results. If confounders are evaluated only at the beginning of a

study and not during follow-up, the potential for residual

confounding should be considered. Other factors that should

be considered in detail include bias, measurement precision,

replicability of observations, data reliability, outliers, select-

ive outcome reporting and fraudulent reporting. What is

crucial is that the way these quality measures are evaluated

is the same across studies. If a particular statistical model

is considered a limitation in one study, it must be determined

whether it should be considered a limitation in other studies

that use it for similar analyses. An evaluation of study quality

should be independent of the results and funding source of

that study. It should be based purely on the methods, in a

consistent manner across studies, and studies with more

robust methods should receive more weight in the overall

analysis.

Study quality can be categorized in a number of ways; it

can be done qualitatively or quantitatively using a ranking,

grading, or scoring method. EPA does not state explicitly how

it characterizes studies within the NAAQS causal framework.

Using a qualitative approach, all studies could be listed in a

table or set of tables, with strengths and limitations in each

column, so one can look across rows and columns to get an

idea of strengths and limitations across studies. When a group

of studies is then evaluated together in Phase 3, one can refer

to this table to determine what weight to give each study

based on its relative strengths and limitations.

There are several quantitative weighting approaches that

have been recommended for evaluating studies based on

quality. For example, Linkov et al. (2011) presented a

quantitative multi-criteria decision analysis framework for

evaluating individual lines of evidence in a WoE evaluation.

Similarly, Lavelle et al. (2012) proposed a step-wise approach

to systematically evaluate human and animal studies, which

includes rating and ranking studies based on data quality

criteria, and Klimisch et al. (1997) proposed a scheme for

evaluating human data into four ranking categories for

inclusion in a WoE evaluation (i.e. reliable without restric-

tion, reliable with restrictions, not reliable and not assign-

able). While all these approaches have merits, any

quantitative scheme can be arbitrary in terms of the weight

given to different facets (e.g. Should exposure measurement

error be given the same weight as an unaccounted for

confounder?). Thus, a qualitative approach should also be

considered.

Regardless of the approach used, studies should not be

eliminated from an analysis simply based on weakness or a

low score in a quantitative ranking; rather, they should be

given less weight in the final analysis. Low quality studies

(e.g. studies with several methodological limitations) need to

DOI: 10.3109/10408444.2013.837864 NAAQS Causal Framework Evaluation 839

Page 12: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

be considered in Phase 3, particularly if these studies are the

basis of an explanation of the observations at hand that

contrasts with another explanation (as discussed further

below). Study quality information should be used to incorp-

orate results into the larger arguments, as in Phase 3, but study

quality should not drive this process. Studies with fatal flaws

(e.g. uninterpretable findings resulting from methodological

errors or falsified data) can be eliminated, however.

After study quality is addressed, the results of individual

studies should be assessed with consideration of strength of

association, internal consistency, biological plausibility, tem-

porality, dose-response, historical control rate, outcome

assessment and random error. Regarding strength of associ-

ation, EPA indicates that larger, precise risks increase

confidence that the association is not due to chance, bias or

other factors. EPA also notes, however, that ‘‘a small

magnitude in an effect estimate may represent a substantial

effect in a population’’ (Table 2). While this statement is true,

consistent with WoE best practices, EPA should apply the

same consideration to the converse: a small effect estimate

that is not due to chance, bias or another factor could

represent a substantial effect, but, because it is small, it may

actually indicate that the association is not causal. A great

deal of effect estimates in the air pollution epidemiology

literature are extremely small (e.g. 51.1); EPA should

acknowledge that associations of small magnitude do not

provide strong evidence of causality, as the hypothesis

attributing the association to a causal effect is not markedly

more plausible than is attributing the association to one or

several of the many factors (e.g. confounders, random chance)

that could partially or fully account for the statistical

association.

Internal consistency is not always, but always should be,

considered when assessing individual studies in the ISAs.

While it is not necessary that every statistical model produce

the same results (if that were the case, one would not need to

assess more than one model), the results should be relatively

consistent. The consideration of internal consistency also

should extend to the evaluation of results for multiple cities or

among regions, as well as for different time periods, as

discussed further below. If results are not consistent, the

reasons why should be explored. That is, the process of

evaluating the consistency and plausibility of support for a

causal interpretation should consider such issues as the

tentative reasons proposed for why apparent discordance

exists; why particular findings among the discordant results

should be considered informative about potential human risk

while other contrary findings are not; and why the contrary

findings should not be viewed as refutations of the causality

assertion. If a lack of causality is as likely an explanation as

another, the study should not be considered evidence for a

causal association.

EPA discusses the importance of biological plausibility, as

a known biologically plausible MoA increases the likelihood

that an association is causal. If there is evidence suggesting an

association is not biologically plausible (e.g. the body’s

defense mechanisms prevent an adverse effect up to a certain

point), however, this must be considered as well. Similarly, as

discussed by EPA (Table 2), a well-characterized exposure–

response relationship increases confidence for a cause-effect

association, but EPA should also consider that a lack of

exposure–response in a study with the statistical power to

evaluate it is evidence against causation.

Temporality is also considered by EPA, but the NAAQS

causal framework is missing a discussion (or evaluation) of

whether effects occur within an appropriate time frame.

Effects that occur at time points that cannot be linked to an

exposure in a biologically plausible manner cannot be

considered evidence of an effect. If a biologically plausible

time frame is not known a priori, one should look for

consistency among the time frames during which effects

occur across studies. Another important factor is the consid-

eration of controls. A control population might not be truly

representative of unexposed individuals, and this may influ-

ence the outcome. It is essential that the control population be

evaluated to determine if and how it influences results.

Another point regarding the assessment of individual study

results is the consideration of whether results are likely a

result of random error or chance. If these are just as likely

explanations as a causal association, one should not consider

an association indicative of causation.

The final step in Phase 2 is to determine whether and to

what extent studies are relevant to ambient exposures and the

relevance (i.e. appropriateness) and adequacy (i.e. usefulness)

of data for hazard and risk assessment. The NAAQS causal

framework indicates this should be done, but this is often not

explicitly discussed in ISAs.

Phase 3: Integrate and evaluate evidence

Phase 3 involves integrating and evaluating data across realms

of evidence (Tables 5 and 6). The NAAQS causal framework

includes modified Bradford Hill criteria (Table 2) to aid in

judgments regarding causality, which is consistent with Phase

3 best practices, but it should also update its guidance on how

the criteria should be applied to the available evidence and

how all criteria should be considered jointly (i.e. as a whole)

and consistently. As discussed above, descriptions for strength

of association, coherence, consistency, biological plausibility

and temporality are missing several key components that

should always be considered in a WoE evaluation (Table 6).

The EPA NAAQS causal framework looks separately at the

collective human studies and animal toxicology evidence,

first coming to a synthesized WoE judgment for each of these

realms and then integrating these separate judgments for

different lines of evidence into an overall qualitative statement

about causality and integration of human and animal toxicity

data with data on exposure levels (US EPA, 2013a). It is

preferable that the data evaluation be integrated across all

lines of evidence based on their mutual illumination of

potential biological processes that may underlie toxicity.

In this way, interpretation of each line of evidence informs the

interpretation of the others. This approach emphasizes the

evaluation of how the results of particular studies can be

applied more generally to inform the potential for similar

causal processes in other studies, including studies in other

realms of investigation. It is the potential for such common-

ality of causal processes that makes each particular result

constitute evidence about the causality question being

evaluated, and it is the reason that animal data can be

840 J. E. Goodman et al. Crit Rev Toxicol, 2013; 43(10): 829–849

Page 13: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

considered as evidence for potential effects in humans. EPA’s

method of integrating judgments at the end of the evaluation

does not allow data from one realm of evidence to influence

conclusions from another. For example, if an epidemiology

analysis can be interpreted two ways, and animal studies can

shed light on whether one is more plausible than the other,

this should be considered when making judgments about the

epidemiology analysis. An empirical finding of, say, a dose-

rate effect or a sex difference may be plausible without any

direct evidence among human data, but animal or mechanistic

studies may provide insight into whether such a phenomenon

has a plausible biological basis in humans. A related aspect is

that one should ensure that added assumptions – even

reasonable and biologically plausible ones – that are invoked

to account for variations within a realm of investigation (say,

among human studies) are not contradicted by assumptions

made when coming to integrated evaluation within another

realm (say, to explain variations among animal bioassay

outcomes in different rodent species). This cannot be done

if judgments are made separately within realms, and only

the conclusions (and not their basis) are integrated at the

end of the process.

An assessment of MoA also plays a crucial role in data

integration because it provides a means of understanding the

degree of parallelism to be expected in underlying causal

processes in human and animal responses, and it can give

insight into the potential reasons behind differences in these

processes at different exposure levels, exposure rates or study

populations. The NAAQS causal framework acknowledges

the importance of considering MoA, but it does not state

explicitly how data relevant to MoA are assessed. To the

degree possible, a proposed MoA and its key events, as well

as any alternative MoAs and their key events, should be

described and evaluated. Uncertainties, inconsistencies and

data gaps should be considered. It should be determined

whether a MoA has been established in animals and whether

key events in animals are plausible in humans, although less

than fully established MoAs can still be valuable tools in

assessing the plausibility of tentative understandings about the

bearing of different kinds of data on the overall causality

question. Finally, exposure–response relationships for par-

ticular MoAs should be evaluated, as some MoAs are

dependent upon exposure rate or cumulative exposure.

While it is not necessary to know the specific MoA to

establish causality, it certainly adds to the WoE of the

likelihood of a causal association. It can also help bound

exposure levels at which a MoA occurs. In addition, a known

MoA may provide evidence against causality if it can be

shown that certain key events in the MoA are not biologically

plausible in humans at relevant exposures.

The NAAQS causal framework includes an evaluation of

the adversity of effects, but this occurs after the WoE

decisions are made and causal relationships are categorized

(in Phase 4). It could also occur in Phase 3, however, when

making judgments about the WoE for causality. This would be

appropriate if the causal question is to establish whether an

agent can cause effects that are specifically deemed adverse,

rather than whether it can cause any effect at all. As Phase 4 is

an assessment of the sufficiency of evidence to make

decisions, and if a decision depends on an adversity question

(as is often the case, such as in the NAAQS process), then the

adequacy of the Phase 3 evaluation to support a Phase 4

conclusion on such matters requires a discussion of adversity,

and of the scientific basis for distinguishing adverse from

non-adverse effects. In general, if effects are determined to be

adaptive, compensatory, transient, or reversible, they are less

likely to be adverse; if they are precursors to apical effects,

severe or irreversible, they are more likely to be adverse

(Goodman et al., 2010). As there are exceptions, these should

not be considered as hard and fast rules; rather, they may

serve as guidelines for the evaluation of the adversity of

effects.

The NAAQS causal framework includes the weighing of

alternative views on controversial issues, but this should be

expanded to an evaluation of alternative explanations of the

same sets of results, regardless of whether the data are

controversial. The evidence for and against a hypothesis

should be discussed, and a clear illustration of how one

explanation compares to another should be provided.

Conclusions regarding causality should be presented not just

as the result of judgments but with their context of reasons for

choosing them over competing conclusions. Assumptions in

the evaluation should be noted, especially when they are

ad hoc (i.e. introduced to explain some phenomenon already

seen). One should also assess whether additional evidence

supports or weakens the causal assumption and provide new

hypotheses or accounts of the evidence if it provides a more

convincing result. The assessment should be updated based on

new evidence if results are ambiguous or a new hypothesis

requires support. At this point, one can formulate a WoE

conclusion, identify data gaps, and propose next steps. The

NAAQS causal framework could be strengthened by utilizing

this final step – particularly identifying data gaps – in Phase 3.

Phase 4: Draw conclusions based on inferences

The NAAQS causal framework proposes categories for causal

relationships (Table 4); based on these relationships, EPA

determines which health effects will be evaluated in quanti-

tative risk assessments. An important issue is that EPA’s

guidance for causality categorization is not congruent with the

judgments based on the Bradford Hill criteria. The framework

claims to rely heavily on the criterion of consistency across

studies in its categorization scheme, but, in practice, it does

not fully evaluate consistency or incorporate other criteria

such as coherence, biological plausibility, biological gradient

or strength of association (discussed below in the case study

of ozone).

EPA states that evidence is sufficient to conclude a causal

relationship if ‘‘chance, bias, and confounding [can] be ruled

out with reasonable confidence’’ (US EPA, 2013a), yet there

is no guidance on what constitutes ‘‘reasonable confidence’’.

Based on the current framework, EPA cannot make that

determination reliably because there is no guidance for

assessing chance, bias, or confounding in a consistent manner.

Although such an assessment is inevitably a scientific

judgment, the basis for coming to and justifying the judgment

should be included in the evaluation. Moreover, an evaluation

of how confident one is in ruling out chance, bias and

confounding should include an examination of the evidence

DOI: 10.3109/10408444.2013.837864 NAAQS Causal Framework Evaluation 841

Page 14: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

for or against competing explanations of the results being

evaluated. This will help determine whether the results are

attributable to a causal effect of the agent or solely the results

of, or at least severely skewed by, the influence of chance,

bias or confounding. This evaluation should involve a

discussion of the possible scope for the chance, bias and

confounding, as well as the evidence for or against the actual

(rather than merely possible) influence of these factors on the

reported outcomes.

EPA suggests ‘‘controlled human exposure studies that

demonstrate consistent effects’’ constitute evidence for a

causal relationship (US EPA, 2013a), but it should indicate

that this is only true if the results are coherent with other lines

of evidence for particular exposure scenarios. EPA also

indicates that ‘‘observational studies that cannot be explained

by plausible alternatives’’ constitute evidence for a causal

relationship (US EPA, 2013a). Yet, EPA does not fully

explore alternative explanations for study results in practice,

as discussed below in the case study of the ozone ISA.

Currently, EPA sets forth a hypothesis (i.e. a criteria pollutant

causes a particular health effect) and determines whether the

data may support that hypothesis. As discussed above, EPA

does not, but should, fully explore whether and to what degree

the data support other hypotheses (e.g. a confounder, rather

than the criteria pollutant, causes a particular health effect). It

is only in this manner that alternative hypotheses can truly be

explored.

EPA states that evidence is sufficient to conclude a likely

causal relationship if ‘‘copollutant exposures are difficult to

address and/or other lines of evidence (controlled human

exposure, animal, or mode of action information) are limited

or inconsistent’’ or if ‘‘animal toxicological evidence from

multiple studies from different laboratories. . .demonstrate

effects, but limited or no human data are available’’ (US EPA,

2013a). EPA concludes evidence is suggestive of a causal

relationship if ‘‘at least one high-quality epidemiologic study

shows an association with a given health outcome but the

results of other studies are inconsistent’’ or if ‘‘a well-

conducted toxicological study, such as those conducted in the

National Toxicology Program (NTP), shows effects in animal

species’’ (US EPA, 2013a).

Any WoE evaluation, by definition, involves a consider-

ation of all lines of evidence in a consistent manner. It is not

about resolving all uncertainty but, rather, determining

whether the evidence as a whole supports causation more

than it supports a lack of effect. If co-pollutants cannot be

addressed or studies are inconsistent, the impact of the study

results is ambiguous and the WoE might equally well be

interpreted to be consistent with a lack of causality. In other

words, evaluating the base of data ‘‘as a whole’’ means that

one attends not simply to how much evidence can be adduced

to support (or to counter) the hypothesized causal effect, but

also how separate lines of evidence support (or contradict)

one another. How discrepancies are to be accounted for and

why apparent counterexamples to the generality of the

proposed causal process should not be regarded as refutations

of its applicability to the human risk question being evaluated

should also be assessed.

The NAAQS causal framework requires only one ‘‘high-

quality’’ study for evidence of causal relationship to be

deemed ‘‘suggestive’’. However, EPA does not provide

explicit guidance on what constitutes a high-quality study,

and, under this requirement, high-quality studies that are

inconsistent may not be considered as long as one high-

quality study exists that demonstrates an effect. Instead, all

studies should be reviewed using the same criteria, and only if

the WoE indicates that a causal association is more likely than

not, based on all the available data, should one conclude a

suggestive causal association. In a similar vein, because EPA

groups health outcomes into broad categories, one high-

quality study of a single endpoint (e.g. wheezing) can be used

as evidence for a suggestive causal relationship for a much

broader category that includes more severe outcomes (e.g.

respiratory effects).

An assessment of the adversity of health effects is included

in the NAAQS causal framework as a step to consider after

categorization of causal relationships, but, as noted above,

such an assessment can also be considered prior to making

judgments regarding causality (i.e. in Phase 3). Regardless, it

is key that EPA be more critical when defining an adverse

effect, and the assessment should differentiate between

homeostatic and biological changes that can adversely affect

health (Goodman et al., 2010). Another issue that is worth

pursuing in the NAAQS causal framework is the determin-

ation of the portion of an adverse health effect burden that

could be plausibly attributed to the particular criteria pollutant

under review. Even if the WoE indicates causality for a given

health effect, the estimated magnitude of the health burden

contributed by that pollutant may be very small (or even

negligible) in comparison to other causes of or contributors to

the same health effect. Failure to place the total burden of an

effect into its proper context is a step that is currently lacking

in regulatory decision-making.

Many of the issues noted above could be resolved by

updating EPA’s categories for causal determination (Table 4)

to be more consistent with the IOM framework (Table 3) on

which it was based originally, and we recommend such an

update to the NAAQS causal framework. This would not

only be in line with WoE best practices, but also would

improve the use of the causal determination categories for

risk communication. In the end, EPA should evaluate all of

the data in a consistent manner using well-specified criteria

and determine whether, as a whole, they constitute evidence

for causation or are more likely indicative of an alternative

explanation. It is by considering these alternatives that one

sees most directly how the evidence at hand relates to

possible conclusions.

Case study: Ozone ISA

The lack of explicit guidance in the NAAQS causal frame-

work has led to it being applied in a manner that does not

consistently meet its objectives to provide and communicate a

sound scientific basis for causality judgments. As a case

study, below, we provide several examples to demonstrate

how this lack of explicit guidance led to an overall unbalanced

evaluation of health effects in the ozone ISA, which yielded

causality conclusions that were not fully supported by the

evidence (US EPA, 2013a). The purpose of this case study is

not to indicate every instance in which the NAAQS causal

842 J. E. Goodman et al. Crit Rev Toxicol, 2013; 43(10): 829–849

Page 15: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

framework is applied inconsistently in the ISA, nor is the aim

to reargue evaluations of specific causality questions. Rather,

the purpose is to provide several concrete examples of the

systematic challenges that arise with the use of the NAAQS

causal framework in practice, as a way to indicate that the

general issues with the framework, that we describe above,

have real occurrences.

Phase 1

The NAAQS causal framework describes a general literature

search strategy and inclusion criteria for study selection, but

does not provide explicit guidance for this and does not

discuss exclusion criteria. EPA provides a database of the

studies considered for inclusion, noting which studies were

cited in the ozone ISA and which were not; however, it is

unclear why particular studies were excluded and if excluded

studies were relevant or informative.

For example, in the ozone ISA, EPA (2013a, p 2-2) states,

‘‘[l]iterature searches have been conducted routinely since

then to identify studies published since the last review,

focusing on studies published from 2005 (closing date for the

previous scientific assessment) through July 2011’’. EPA

included the study by Zanobetti & Schwartz (2011) in the

ozone ISA but omitted a study by Lipsett et al. (2011) that

was published online the same day (23 June 2011). EPA also

omitted a study by Spencer-Hwang et al. (2011), which was

published online on 21 July 2011. In addition, there were

several studies of both ozone and PM that were not included

in the ozone ISA but played a prominent role in EPA’s PM

evaluation (e.g. Jerrett et al., 2005; Miller et al., 2007). This

indicates that not all relevant studies were captured by the

literature search strategy used.

Study quality is a basis for study selection in the NAAQS

causal framework, but, as discussed above, studies of lower

quality could still provide information and should be included

and weighed in later phases of the evaluation. Some lower

quality studies that, despite their shortcomings, may have

contributed usefully to the WoE were likely excluded early on

from EPA’s analysis. The overall study selection process in

the ozone ISA builds on past analyses and public comments,

so it is possible that some relevant studies that may have been

excluded were considered later in the process. However,

without a systematic method for identifying and selecting

studies up front, the overall assessment may be biased.

Phase 2

Evaluation of study quality

The NAAQS causal framework does not describe a specific

method for evaluating the strengths and limitations of

individual studies and using these to determine the weight

of each study in the evaluation of causality. This led to an

inconsistent evaluation of individual studies throughout the

ozone ISA. As discussed below, there are many examples

in the ozone ISA where studies with positive associations

were emphasized over studies with null associations, as

opposed to studies of greater quality being emphasized over

those of lesser quality. This provided a false perception that

most of the reliable evidence supported a positive causal

association.

In the discussion of respiratory effects in adult day-hikers

in the ozone ISA, positive associations with lung function

decrements reported in one study (Korrick et al., 1998) were

emphasized over the null associations for the same endpoints

reported in another study (Girardot et al., 2006), and there

was no discussion of the strengths and limitations of either

study. In the evaluation of studies examining short-term ozone

exposure and cause-specific mortality, EPA stated that there is

evidence of associations with cardiovascular (CV)- and

respiratory-specific mortality, yet all risk estimates for

respiratory mortality and almost all for CV mortality were

null in analyses of these endpoints year-round; results were

mixed for both endpoints in analyses of the summer months,

when ambient ozone concentrations are highest.

In its summary of epidemiology data for short-term effects

of ozone on pulmonary inflammation and oxidative stress in

the ozone ISA, EPA stated that many recent studies reported

positive associations. Yet throughout its discussion, EPA

noted that the results were mixed and inconsistent. EPA did

not provide an explanation for why studies with positive

associations should carry more weight than those reporting

null associations. Finally, in the discussion of short-term

effects of ozone on respiratory symptoms, EPA stated there is

a ‘‘strong’’ body of evidence demonstrating associations

between ozone and increased respiratory symptoms (e.g.

wheezing) and asthma medication (e.g. bronchodilator) use in

asthmatic children, yet almost all risk estimates were null for

these outcomes. This also suggests that EPA’s evaluation of

the epidemiology data did not fully consider evidence from

studies with null results.

EPA’s evaluation of lung function decrements in asthmatic

children is another example of how the lack of methods for

evaluating the strengths and limitations of studies leads to an

overall inconsistent evaluation of individual studies across

pollutants. EPA relied on the study by Mortimer et al. (2002)

as a key study to indicate that increasing ozone concentrations

are associated with decreasing lung function in children with

asthma, as determined by self-reported peak expiratory flow

rate (PEFR) measurements (i.e. the maximum speed at which

an individual can breathe out air, a measure of the degree of

airway obstruction). By contrast, in the ISA for nitrogen

dioxide (US EPA, 2008a), EPA determined that the same

study was unreliable because of the use of self-reported PEFR

data, which has been demonstrated to have an unacceptably

high rate of error (Kamps et al., 2001).

If the NAAQS causal framework provided explicit guid-

ance on evaluating the strengths and limitations of each study,

it would allow for weight to be assigned to the studies in a

consistent manner based on quality. It could then be

determined whether the large number of studies with null

results carry equal or greater weight than the number with

positive results, strengthening the evidence against a causal

relationship, or whether the positive studies carry more

weight and EPA’s emphasis on these results is justified.

Consideration of study limitations

Measurement error. Exposure and outcome measurement

biases are potentially significant sources of uncertainty in

effect estimates derived from epidemiology studies of ozone

DOI: 10.3109/10408444.2013.837864 NAAQS Causal Framework Evaluation 843

Page 16: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

exposure. Most epidemiology studies rely on data from

central ambient monitoring sites to provide community-

average ambient ozone exposure concentrations (e.g. Gent

et al., 2003; Katsouyanni et al., 2009; Mortimer et al., 2002;

Naeher et al., 1999; Neas et al., 1999; Stieb et al., 2009), and

the interpretation of statistical associations is predicated upon

the assumption that these ambient measurements reflect

personal exposures. Studies have shown that large discrepan-

cies exist in the ambient/personal ozone relationships, how-

ever, and these can vary across cities (e.g. Sarnat et al., 2001,

2005). Thus, the use of fixed ambient monitoring data as

surrogates for personal ozone exposures can result in

exposure measurement error, which can bias the results of

an epidemiology analysis in either direction (Jurek et al.,

2005, 2008; Wacholder et al., 1995). Rhomberg et al. (2011b)

reviewed exposure measurement errors that prevail in envir-

onmental epidemiology studies, demonstrating that the degree

of bias known to apply to these studies is sufficient to produce

a false low-dose linear result. In the ozone ISA, EPA

acknowledged that exposure measurement error is a potential

source of variability in the ozone epidemiology study results,

but it did not give weight to the fact that it is also a potentially

large source of bias that could account for the small statistical

associations reported in many of the studies on which EPA

relies for evidence of causality.

Outcome measurement error results from inaccurate

reporting of health effects, and this was also not always

considered a limitation in the ozone ISA. For example, self-

measured and self-reported lung function measurements are

an important limitation of many ozone respiratory morbidity

studies. These measurements can be unreliable, particularly if

a population is young and/or not highly motivated to comply

and when a long time period of self-reporting is required.

Despite this limitation, EPA relied on a number of studies

using self-reported measures of lung function as evidence of

causality in the ozone ISA. In one key study on which EPA

relied for lung function effects in asthmatic children,

O’Connor et al. (2008) evaluated lung function changes that

were self-measured by the children; EPA did not acknowledge

the high frequency of unreliable values in this study. Some

other studies on which EPA relied also included self-reporting

of symptoms and, as a result, were subject to recall bias. For

example, in a study of children with asthma, Gent et al. (2003)

used a non-standard questionnaire (a daily calendar) and

relied on subjective symptom reporting by participants’

mothers, which could have biased the effect estimates.

Overall, in the ozone ISA, EPA did not give appropriate

weight to the potential bias from outcome measurement error.

Confounders. Confounding is a major limitation of a number

of key ozone respiratory morbidity studies that is not

adequately accounted for in the ozone ISA. For example,

Mortimer et al. (2002) reported associations between ozone

and asthma exacerbation in children in single-pollutant

models, but no statistically significant associations were

observed in multi-pollutant models. This indicates that the

effect may have been influenced partially or fully by other

pollutants. Korrick et al. (1998) also adjusted for several

confounding factors in their study of ozone effects on lung

function in adult hikers, showing that confounding by PM2.5

and aerosol acidity could not be ruled out. Similarly, in a

study by Neas et al. (1999) of associations between decreased

lung function in children and ozone exposure, effects were

attenuated in a co-pollutant model with sulfate. EPA tends to

downplay the effect of co-pollutants, and generally considers

only study results from single-pollutant models as positive

evidence for causality.

Temperature and other environmental factors can also

confound the relationship between ozone and respiratory

morbidity, and it is not clear if this is considered by EPA. For

example, the longitudinal study of children with asthma by

Gent et al. (2003) only considered same-day maximum

temperature in the statistical models, while meteorological

variables such as relative humidity may have been potential

confounders of respiratory symptoms. Air conditioning use

and exposure to tobacco smoke are also important potential

confounders of the associations with respiratory effects that

were not accounted for in some of the key studies on which

EPA relied for its causality judgments (Mortimer et al., 2002;

Stieb et al., 2009).

Confounding by co-pollutants, meteorology, and other

factors is also a major source of uncertainty in time-series

analyses of ozone-mortality associations. This issue is

complicated by the fact that air pollutants can be highly

correlated, and it is often dismissed because of this. Some

studies have reported risk estimates that appear to be robust to

inclusion of co-pollutants (Bell et al., 2007), but findings

from other studies affirm that this issue remains relevant and

important. Specifically, many studies have investigated con-

founding of PM up to 10 micrometers in diameter (PM10),

PM2.5 and some PM components, and found evidence of

confounding effects (Franklin & Schwartz, 2008; Katsouyanni

et al., 2009; Smith et al., 2009), and a few studies have shown

confounding effects of temperature (Smith et al., 2009). For

example, Smith et al. (2009) reported that inclusion of PM10

with ozone in a two-pollutant model resulted in a 22–33%

decrease in the ozone-mortality association. Similarly,

Franklin & Schwartz (2008) reported a 31% decrease in the

ozone-mortality risk estimate when sulfate PM was included

in the model. Smith et al. (2009) also considered alternative

models to control for meteorology that resulted in reduced

mortality estimates, suggestive of additional confounding by

temperature. EPA should consider both the studies that show

confounding as well as those that do not in its assessments.

Although confounding is discussed as part of the evalu-

ation of respiratory morbidity and mortality in the ozone ISA,

it does not appear to play a key role in EPA’s ultimate

conclusions regarding the causal relationship between ozone

and these outcomes.

Model specification. The ozone ISA noted that, in selecting

epidemiology studies, consideration was given to methodo-

logical issues related to the interpretation of the evidence

(including model selection), but this was often not the case.

The results of many studies, including meta-analyses of ozone

mortality time-series studies, have indicated that model

selection has a key role in the determination of results.

The significance of model selection was brought to bear in

2002 when National Morbidity, Mortality, and Air Pollution

Study (NMMAPS) researchers discovered an issue with the

844 J. E. Goodman et al. Crit Rev Toxicol, 2013; 43(10): 829–849

Page 17: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

way the commonly used statistical model, the Generalized

Additive Model (GAM), was implemented by the statistical

software, S-plus. Specifically, these issues led to an investi-

gation of how differences in the level of control for temporal

trends and weather can affect the magnitude of effect

estimates. When compared to other statistical models or

different criteria for controlling these important confounding

factors, the GAM model in S-plus was generating much larger

health effects estimates (HEI, 2003). Similarly, studies have

also shown that ozone mortality estimates can vary depending

on the degrees of freedom selected for smoothing long-term

trend (HEI, 2003; Ito et al., 2005; Katsouyanni et al., 2009).

In the Air Pollution and Health: a European and North

American Approach (APHENA) study, which included

datasets from US, Canadian, and European multi-city studies,

ozone mortality estimates were sensitive to the smoothing

function type applied (Katsouyanni et al., 2009). Despite

extensive sensitivity analyses comparing a number of differ-

ent models, Katsouyanni et al. (2009) were unable to identify

a model deemed most appropriate for comparing health effect

estimates across the different study locations they evaluated.

They reported large differences with penalized versus natural

splines, as results were negative when penalized splines were

used and positive when natural splines were used. In the

ozone ISA, EPA only presented the positive associations that

were reported from the use of natural splines, because

‘‘alternative spline models have been previously shown to

result in similar effect estimates’’. Although EPA provided a

justification for why it did not present the APHENA results

from both smoothing functions, this justification does not

make sense when the large APHENA study (upon which EPA

relied heavily in the ozone ISA for its causal determinations)

indicates that there is sensitivity of risk estimates to the type

of smoothing function used in the model.

These examples highlight the prevalence of model selec-

tion bias in the epidemiology literature. Despite findings from

several researchers that have shown the substantial impact of

model selection bias (e.g. Clyde, 2000; Ito, 2003; Koop &

Tole, 2004; Koop et al., 2010; Moolgavkar et al., 2013), EPA

has not systematically considered this issue in its causality

determinations for ozone.

Phase 3

Consideration of modified Bradford Hill criteria

The NAAQS causal framework indicates that modified

Bradford Hill criteria should be used to aid in causal

judgments, but it does not provide explicit guidance regarding

how these criteria should be applied to the available evidence

and how all criteria should be considered jointly.

Furthermore, several criteria were not fully considered by

EPA in the ozone ISA, including biological plausibility,

biological gradient and strength of association. Several

examples are described below.

EPA’s evaluation of lag times (i.e. the period of time

between the measurement of an exposure and an effect) is an

example of not fully considering biological plausibility. Many

epidemiology studies utilized statistical models that examined

health effects occurring at several time points after measured

ozone exposures. Extensive human clinical and mechanistic

data on ozone indicate that the respiratory effects of ozone

occur soon after exposure. Findings for 0- to 1-day lags or for

cumulative ozone exposure over a few days are most likely to

be biologically plausible, while findings for longer lags are

less likely to be so. Therefore, effects (or a lack thereof)

reported at the earlier time points should have carried more

weight than those reported at later times, but, in the ozone

ISA, effects at any time point were generally considered

evidence for causality.

EPA also often overlooked the criterion of biological

gradient (i.e. exposure–response relationship) in drawing

conclusions in the ozone ISA. For example, EPA noted that

there is no exposure–response relationship for lung function

changes across short-term exposure studies of outdoor

workers (Brauer et al., 1996; Chan & Wu, 2005; Hoppe

et al., 1995; Romieu et al., 1998). This provides evidence

against a causal association. Yet, in the ozone ISA’s summary

and causal conclusion for short-term respiratory effects, this

was not discussed, indicating it carried little, if any, weight in

EPA’s final evaluation.

As a final example, EPA did not fully consider the strength

of associations reported in the ozone epidemiology studies

and did not factor this into the overall judgments of causality

in the ozone ISA. EPA appears to have considered any risk

estimate above the null as supportive of a causal relationship,

regardless of the magnitude of the association. Many

epidemiology studies report very weak findings, and this is

true for studies of ozone. As reviewed by Taubes (1995),

epidemiologists generally consider risk estimates greater than

3–4 to reflect strong associations and to be supportive of a

causal link, while smaller risk estimates (1.5–3) are con-

sidered to be weak and require other lines of evidence to

demonstrate causality. In addition, many small, statistically

significant associations are found to be not statistically

significant when confounders are accounted for. Boffetta

et al. (2008) noted that the importance of unmeasured and

residual confounding as a source of bias has been sometimes

downplayed in the literature, but a recent statistical simulation

study showed that, with plausible assumptions, such con-

founding can generate effect sizes in the range of 1.5–2.0.

EPA relied on many studies with risk estimates even lower

than this. For example, in EPA’s evaluation of associations

between ozone exposure and respiratory symptoms in

children with asthma, 35 of 37 risk estimates above the null

that were presented as evidence of effects were between 1.01

and 1.39, with two additional risk estimates of 1.59 and 2.13.

EPA did not fully consider that the positive associations on

which it relied for causal associations are often weak and

within the range of magnitude that can be attributed to

confounding. If these weak associations are to be considered

as evidence of causality, consistent with WoE best practices,

EPA should also give equal weight to associations of equal

magnitude in the opposite direction (i.e. 0.72 to 0.99) as

evidence for a protective effect of ozone. As a protective

effect of ozone on the respiratory system is not plausible,

associations of this magnitude (either positive or negative) are

likely not associated with ozone exposure and instead can be

attributed to confounding or other factors.

The above examples indicate some of the specific criteria

that were not fully considered by EPA. It is also apparent that

DOI: 10.3109/10408444.2013.837864 NAAQS Causal Framework Evaluation 845

Page 18: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

not all of the criteria are considered jointly. For each health

effect, EPA should systematically consider each of the criteria

and determine where the WoE lies. While every criterion does

not have to be met per se, each needs to be considered.

For those that are not met, EPA should provide an explan-

ation of why that criterion does not affect the interpretation of

the results. With no clear guidance on how to apply the

criteria or how to consider them all jointly, EPA’s evaluations

of individual studies tend to be biased toward findings

of effect.

Weighing alternative views

The NAAQS causal framework indicates one should weigh

alternative views on controversial issues, but it does not

provide guidance on how to do this. The framework should be

more explicit in this regard, e.g. by noting that evidence for

and against competing explanations should be discussed and

clear illustrations of how one explanation compares to the

other should be provided. There are several instances in the

ozone ISA where EPA did not discuss or give due weight to

alternative views.

For example, EPA noted that consideration of the limita-

tions of epidemiology studies, such as potential confounding

and exposure measurement error, must be taken into account

to properly inform the interpretation of epidemiology evi-

dence. Yet, in its final evaluation, EPA did not consider that

another factor may have caused the health effects associated

with ozone in certain epidemiology studies. EPA did not

discuss the reasons why this view (i.e. ozone is not causal)

is less likely to be true than the view that ozone is the

causal factor.

In addition, EPA reviewed several recent controlled human

exposure studies of ozone and decrements in lung function

over a total of 6.6 hours of exposure (Adams, 2006; Schelegle

et al., 2009; Kim et al., 2011). The mean lung function

decrement for the subjects in each study after exposure to 60

parts per billion (ppb) ozone was only statistically significant

in the study by Kim et al. (2011). Because of concern that the

statistical test in the original publication was not sufficiently

powerful to detect an effect on lung function, EPA reanalyzed

the data for the final time point (but not at other interim

hourly time points) in the study by Adams (2006) using

several statistical methods that differed from those of the

study author. This reanalysis, reported by Brown (2007) and

Brown et al. (2008), indicated statistically significant lung

function decrements at 60 ppb ozone. Other investigators have

also reanalyzed the data from the study by Adams (2006)

using various statistical methods and reported no statistically

significant changes in lung function (Lefohn et al., 2010;

Nicolich, 2007). It has been argued that EPA’s approach

produced statistically significant results that can be attributed

to the majority of the data being selectively omitted from the

analysis (Goodman & Sax, 2012). EPA appeared to weigh the

findings of its own reanalyses over the original analyses

conducted by the study authors and reanalyses by other

investigators without fully considering the strengths and

limitations of each analytical method.

As a final example, EPA examined the concentration-

response (C-R) relationship between short-term ozone

exposure and mortality, but it did not weigh alternative

views regarding the shape of the C-R curve and the possible

existence of a threshold (i.e. the minimum ozone concentra-

tion necessary to produce an effect). It is well recognized that

epidemiology analyses such as those that investigate associ-

ations between mortality and air pollution exposures are

complicated by the noise in the observational data and

uncertainties in the epidemiology models, both of which may

mask a threshold. Despite the challenges in identifying ozone

thresholds in ozone-mortality evaluations, several studies

have applied various statistical techniques to evaluate the

nature of the ozone-mortality C-R function and found

evidence suggestive of a threshold for ozone-mortality effects

(Smith et al., 2009; Stylianou & Nicolich, 2009; Xia & Tong,

2006). Similarly, Jerrett et al. (2009) reported evidence of

improved model fit with a threshold model. EPA appeared to

weigh the view that there is no clear threshold over the

alternative view that a threshold exists, even though there is

mounting evidence to support the latter.

Overall, while EPA often discusses alternative views, it

does not appear to weigh them appropriately. To avoid this,

the NAAQS causal framework should provide explicit

guidance regarding how to discuss and compare the evidence

for and against competing explanations.

Phase 4

The NAAQS causal framework claims to rely heavily on the

consistency of results across studies in its categorization of

causal relationships between criteria pollutants and different

health effects, but, in practice, the consistency of results is not

evaluated fully. An example of this can be seen by comparing

the classification of causal relationships for ozone-related

mortality in the ozone ISA to those in the previous NAAQS

review of ozone, as summarized in the 2006 AQCD for ozone

(US EPA, 2006).

EPA’s causal determination for short-term ozone exposure

and total all-cause mortality was upgraded from ‘‘highly

suggestive’’ in the previous review to ‘‘likely causal’’ in the

ozone ISA. In the 2006 AQCD, a large body of epidemiology

studies, including single- and multi-city studies, was eval-

uated. Only a limited number of additional studies have been

published since then, including several that are follow-up

studies or reanalyses of previous studies. Importantly, EPA

stated in the ozone ISA that the new evidence is supportive of

a robust association between short-term ozone exposures and

mortality, but the studies showed very small pooled mortality

effects that are inconsistent across studies. Furthermore,

multi-city studies have demonstrated that there is substantial

heterogeneity in city-specific findings, which are dominated

by ozone associations that are not statistically significant and

close to null, or even negative (i.e. indicating reduced

mortality with increased ozone exposure). The pooled

‘‘national’’ ozone-mortality effect estimates are therefore

misleading, as they do not reflect this heterogeneity.

In addition, EPA did not fully consider many of the important

uncertainties that are still present in the new studies, including

confounding, model selection, appropriate lag times, non-

threshold assumptions in the C-R relationship, and exposure

measurement error.

846 J. E. Goodman et al. Crit Rev Toxicol, 2013; 43(10): 829–849

Page 19: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

With respect to long-term ozone exposure and mortality,

EPA relied on very limited new evidence since the last ozone

review in upgrading its classification from ‘‘insufficient’’ to

‘‘suggestive’’ evidence of a causal relationship in the ozone

ISA. Specifically, EPA appears to have relied solely on one

new study that provided some findings suggesting an

increased risk of respiratory-related mortality for long-term

ozone exposure (but no consistent associations for all-cause

mortality) (Jerrett et al., 2009), and did not fully consider

other key studies that suggest no effects of ozone on total and

cause-specific mortality or counterintuitive geographical

heterogeneity in mortality risks (e.g. Abbey et al., 1999;

Dockery et al., 1993; Jerrett et al., 2005; Lipfert et al., 2000;

Miller et al., 2007). In addition, new studies published since

the last review support a lack of association between long-

term ozone exposure and mortality from specific causes,

including respiratory causes (e.g. Lipsett et al., 2011). This

example illustrates that requiring only one positive (which

may be high-quality, but still may have issues) study to

provide ‘‘suggestive’’ evidence can yield conclusions toward

a causal determination even if the majority of the evidence

does not support this.

Because the NAAQS causal framework does not fully

evaluate the consistency of results across studies or fully

incorporate other Bradford Hill criteria such as coherence,

biological plausibility, biological gradient and strength of

association, the resulting causality categorization is not

congruent with judgments based on these criteria. To address

this, EPA could redefine these categories so that they more

accurately reflect the WoE. Further, in harmony with best

practices, EPA should evaluate all of the data in a consistent

manner, using well-specified criteria that can be applied

consistently across evaluations over time, to determine

whether they constitute evidence for causation.

Conclusions

The current NAAQS framework for causal determination has

several valuable features, but there are certain elements that

are missing or not stated explicitly. To improve the NAAQS

framework, EPA could add explicit guidance regarding study

selection criteria. These criteria should include specifying

that no studies be excluded based on their quality or results;

rather, weak studies should be given less weight in the final

analysis. Also, to reduce bias, analyses conducted by EPA

should not be included unless they are peer reviewed. In

addition, guidance for data extraction, clearly stating that all

data relevant to the causal question should be extracted from

each study and not just a subset of the data, should be added

to the NAAQS framework. In terms of evaluating study

quality and results, this should be done in the same manner

across all studies to ensure that the assessment is thorough

and consistent. Data suggesting a lack of biological plausi-

bility or a lack of an exposure–response relationship for a

specific effect should be considered as evidence against

causation.

The NAAQS causal framework should provide guidance

on how the Bradford Hill criteria should be applied to the

available evidence and how all criteria should be considered

jointly; the lack of such guidance in the current framework

can lead to an assessment that unduly favors a causal

determination. Indeed, in following the current framework,

EPA sets forth a hypothesis and determines whether the data

support that hypothesis. The framework should include

guidance regarding the weighing of alternative hypotheses,

explicitly directing EPA to fully explore whether and to what

degree the data support other hypotheses. The current

framework’s method of integrating judgments for different

realms of evidence at the end the evaluation also does not

allow data from one realm of evidence to inform the

interpretation of the others. The framework should state

explicitly that the data evaluation be integrated across all

realms of evidence to provide insight into potential biological

processes underlying the effects under consideration

for causality.

Lastly, there should be guidance for causality categoriza-

tion that is congruent with the judgments based on the

Bradford Hill criteria, and the use of categories for causal

determination should be based on a consistent evaluation of

all of the data using well-specified criteria for causation.

Although the current NAAQS causal framework includes

several of the elements discussed above, it does not

always contain explicit guidance for them, and this can

lead to conclusions of causality that are not fully supported

by the evidence.

We provided some specific examples of how the limita-

tions of the current NAAQS causal framework led to an

evaluation of ozone-related health effects that was not always

consistent and yielded causality conclusions that were not

fully supported by the evidence. This is also apparent in other

assessments that used the NAAQS causal framework. For

example, in a 9 December 2011, letter to EPA, the Clean Air

Scientific Advisory Committee review panel for the first draft

ISA for lead recommended a ‘‘more rigorous and transparent

‘weight of the evidence’ analysis’’ that should ‘‘devote more

attention to the limitations of the existing studies with respect

to consistency, reproducibility, bias, control for confounders,

and shortcomings in statistical methodology’’ (Frey & Samet,

2011). In addition, many of the same issues were also

identified in EPA’s draft IRIS assessment of formaldehyde,

prompting the NAS panel that reviewed that assessment to

propose a roadmap for improving the risk assessment process

and to recommend that EPA use existing models of systematic

WoE approaches for hazard identification.

If EPA undertakes a program to revise the NAAQS causal

framework to be more consistent with the NAS roadmap, it is

our hope that the suggestions in this paper will be considered

in the process. The NAAQS causal framework could be

revised so that it is consistent with WoE best practices, and

the causal determination categories should be updated to

reflect these revisions. The full and consistent application of

the revised framework is needed to ensure that future

assessments of the potential health effects of criteria air

pollutants will be thorough, transparent, and scientifically

sound.

Acknowledgements

The authors thank the anonymous reviewers whose comments

helped to improve this paper.

DOI: 10.3109/10408444.2013.837864 NAAQS Causal Framework Evaluation 847

Page 20: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

Declaration of interest

The authors are employed by Gradient, a private environ-

mental consulting firm. The Gradient staff has strong

expertise in assessing human, experimental animal, and

mechanistic data in WoE analyses (as is evident in recent

evaluations conducted for bisphenol A, naphthalene, formal-

dehyde, chlorpyrifos, methanol, styrene, nickel, and toluene

diisocyanate) and has presented several of these analyses to

regulatory bodies. In addition, the Gradient staff, including

the authors of this paper, have carefully evaluated the science

underlying EPA’s review of various NAAQS and offered both

oral and written testimony to EPA. Gradient has also

addressed issues on systematic review and integration of

evidence for a number of clients. The work reported in this

paper was conducted by the authors during the normal course

of employment at Gradient with financial support provided by

the American Petroleum Institute (API). API is the major

trade association of corporations in the petroleum sector, from

discovery through production and refining. Drafts of this

paper were reviewed by members or affiliates of API. The

authors have the sole responsibility for the writing, contents,

and conclusions in this paper. The conclusions are not

necessarily those of API.

References

Abbey DE, Nishino N, McDonnell WF, et al. (1999). Long-terminhalable particles and other air pollutants related to mortality innonsmokers. Am J Respir Crit Care Med, 159, 373–82.

Adami HO Berry SC, Breckenridge C, et al. (2011). Toxicology andepidemiology: Improving the science with a framework for combiningtoxicological and epidemiological evidence to establish causalinference. Toxicol Sci, 122, 223–34.

Adams WC. (2006). Comparison of chamber 6.6-h exposures to 0.04–0.08 ppm ozone via square-wave and triangular profiles on pulmonaryresponses. Inhal Toxicol, 18, 127–36.

Bachmann J. (2007). Will the circle be unbroken: a history of the USNational Ambient Air Quality Standards. J Air Waste Manage Assoc,57, 652–97.

Bell ML, Kim JY, Dominici F. (2007). Potential confounding ofparticulate matter on the short-term association between ozone andmortality in multisite time-series studies. Environ Health Perspect,115, 1591–5.

Boffetta P, McLaughlin JK, La Vecchia C, et al. (2008). False-positiveresults in cancer epidemiology: a plea for epistemological modesty.J Natl Cancer Inst, 100, 988–95.

Borgert CJ, Mihaich EM, Ortego LS, et al. (2011). Hypothesis-drivenweight of evidence framework for evaluating data within the US EPA’sEndocrine Disruptor Screening Program. Regul Toxicol Pharmacol,61, 185–91.

Brauer M, Blair J, Vedal S. (1996). Effect of ambient ozone exposure onlung function in farm workers. Am J Respir Crit Care Med, 154,981–7.

Brown JS, Bateson TF, McDonnell WF. (2008). Effects of exposure to0.06 ppm ozone on FEV1 in humans: a secondary analysis of existingdata. Environ Health Perspect, 116, 1023–6.

Brown JS. [US EPA, National Center for Environmental Assessment(NCEA)]. (2007). Memo to Ozone NAAQS Review Docket (OAR-2005-0172) re: the effects of ozone on lung function at 0.06 ppm inhealthy adults. 8p., June 14.

Chan CC, Wu TH. (2005). Effects of ambient ozone exposure on mailcarriers’ peak expiratory flow rates. Environ Health Perspect, 113,735–8.

Clyde M. (2000). Model uncertainty and health effect studies forparticulate matter. Environmetrics, 11, 745–63.

Dockery DW, Pope CA, Xu X, et al. (1993). An association between airpollution and mortality in six U.S. cities. N Engl J Med, 329, 1753–9.

European Centre for Ecotoxicology and Toxicology of Chemicals(ECETOC). (2009). Framework of the Integration of Human and

Animal Data in Chemical Risk Assessment. ECETOC TechnicalReport No. 104. 130p., January.

Franklin M, Schwartz J. (2008). The impact of secondary particles on theassociation between ambient ozone and mortality. Environ HealthPerspect, 116, 453–8.

Frey HC, Samet JM. (2011). Letter to L. Jackson re: CASAC Review ofEPA’s ‘‘Integrated Science Assessment for Lead (First ExternalReview Draft – May 2011)’’. Clean Air Scientific AdvisoryCommittee (CASAC). 120p., December 9. [Online] Available from:http://yosemite.epa.gov/sab/sabproduct.nsf/0/D3E2E8488025344D852579610068A8A1/$File/EPA-CASAC-12-002-unsigned.pdf [lastaccessed 29 June 2012].

Gent JF, Triche EW, Holford TR, et al. (2003). Association of low-levelozone and fine particles with respiratory symptoms in children withasthma. JAMA, 290, 1859–67.

Girardot SP, Ryan PB, Smith SM, et al. (2006). Ozone and PM2.5exposure and acute pulmonary health effects: a study of hikers in theGreat Smoky Mountains National Park. Environ Health Perspect, 114,1044–52.

Goodman JE, Dodge DG, Bailey LA. (2010). A framework for assessingcausality and adverse effects in humans with a case study of sulfurdioxide. Regul Toxicol Pharmacol, 58, 308–22.

Goodman JE, Sax SN. (2012). Comments on the Integrated ScienceAssessment for Ozone and Related Photochemical Oxidants (ThirdExternal Review Draft). 125p. Submitted to EPA on 15 August 2012.Docket ID: EPA-HQ-ORD-2011-0050.

Health Effects Institute (HEI). 2003. Revised analyses of time-seriesstudies of air pollution and health. 306p., May.

Higgins JPT, Green S. (2011). Cochrane handbook for systematicreviews of interventions (Version 5.1.0). The Cochrane Collaboration.March. Available from: http://www.cochrane-handbook.org/ [lastaccessed 03 Feb 2012].

Hill AB. (1965). The environment and disease: association or causation?Proc R Soc Med, 58, 295–300.

Hoppe P, Praml G, Rabe G, et al. (1995). Environmental ozone fieldstudy on pulmonary and subjective responses of assumed risk groups.Environ Res, 71, 109–21.

Institute of Medicine (IOM). (2008). Improving the presumptivedisability decision-making process for veterans. Committee onEvaluation of the Presumptive Disability Decision-Making Processfor Veterans, Board on Military and Veterans Health, NationalAcademies Press. 781p.

International Agency for Research on Cancer (IARC). (2006). Preambleto the IARC monographs on the evaluation of carcinogenic risks tohumans. 27p. Available from: http://monographs.iarc.fr/ENG/Preamble/CurrentPreamble.pdf [last accessed 20 Sep 2012].

Ito K. (2003). Associations of particulate matter components with dailymortality and morbidity in Detroit, Michigan. In: Revised analyses oftime-series studies of air pollution and health. Boston, MA: HealthEffects Institute, 143–156.

Ito K, De Leon SF, Lippmann M. (2005). Associations between ozoneand daily mortality: analysis and meta-analysis. Epidemiology, 16,446–57.

Jerrett M, Burnett RT, Pope CA, et al. (2009). Long-term ozone exposureand mortality. N Engl J Med, 360, 1085–95.

Jerrett M, Burnett RT, Ma R, et al. (2005). Spatial analysis of airpollution and mortality in Los Angeles. Epidemiology, 16, 727–36.

Jurek AM, Greenland S, Maldonado G. (2008). How far from non-differential does exposure or disease misclassification have to be tobias measures of association away from the null? Int J Epidemiol, 37,382–5.

Jurek AM, Greenland S, Maldonado G, Church TR. (2005). Properinterpretation of non-differential misclassification effects: expect-ations vs. observations. Int J Epidemiol, 34, 680–7.

Kamps AW, Roorda RJ, Brand PL. (2001). Peak flow diaries inchildhood asthma are unreliable. Thorax, 56, 180–2.

Katsouyanni K, Samet JM, Anderson HR, et al. (2009). Air pollutionand health: a European and North American Approach (APHENA).HEI Research Report 142. 132p., October 29.

Kim CS, Alexis NE, Rappold AG, et al. (2011). Lung function andinflammatory responses in healthy young adults exposed to 0.06 ppmozone for 6.6 hours. Am J Respir Crit Care Med, 183, 1215–21.

Klimisch HJ, Andreae M, Tillmann U. (1997). A systematic approach forevaluating the quality of experimental toxicological and ecotoxico-logical data. Regul Toxicol Pharmacol, 25, 1–5.

848 J. E. Goodman et al. Crit Rev Toxicol, 2013; 43(10): 829–849

Page 21: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

Koop G, McKitrick R, Tole L. (2010). Air pollution, economic activityand respiratory illness: evidence from Canadian cities, 1974–1994.Environ Model Softw, 25, 873–85.

Koop G, Tole L. (2004). Measuring the health effects of air pollution: towhat extent can we really say that people are dying from bad air?J Environ Econ Manage, 47, 30–54.

Korrick SA, Neas LM, Dockery DW, et al. (1998). Effects of ozone andother pollutants on the pulmonary function of adult hikers. EnvironHealth Perspect, 106, 93–9.

Lavelle KS, Schnatter AR, Travis KZ, et al. (2012). Framework forintegrating human and animal data in chemical risk assessment. RegulToxicol Pharmacol, 62, 302–12.

Lefohn AS, Hazucha MJ, Shadwick D, Adams WC. (2010). Analternative form and level of the human health ozone standard. InhalToxicol, 22, 999–1011.

Lipfert FW, Perry HM, Miller JP, et al. (2000). The WashingtonUniversity-EPRI Veterans’ Cohort Mortality Study: preliminaryresults. Inhal Toxicol, 12, 41–73.

Lipsett MJ, Ostro BD, Reynolds P, et al. (2011). Long-term exposure toair pollution and cardiorespiratory disease in the California teachersstudy cohort. Am J Respir Crit Care Med, 184, 828–35.

McClellan RO. (2012). Role of science and judgment in setting nationalambient air quality standards: how low is low enough? Air QualAtmos Health, 5, 243–58.

Miller KA, Siscovick DS, Sheppard L, et al. (2007). Long-term exposureto air pollution and incidence of cardiovascular events in women.N Engl J Med, 356, 447–58.

Moolgavkar SH, McClellan RO, Dewanji A, et al. (2013). Time-seriesanalyses of air pollution and mortality in the United States: asubsampling approach. Environ Health Perspect, 121, 73–8.

Mortimer KM, Neas LM, Dockery DW, et al. (2002). The effect of airpollution on inner-city children with asthma. Eur Respir J, 19,699–705.

Naeher LP, Holford TR, Beckett WS, et al. (1999). Healthy women’sPEF variations with ambient summer concentrations of PM10, PM2.5,SO42-, Hþ, and O3. Am J Respir Crit Care Med, 160, 117–25.

National Research Council (NRC). (2011). Review of the EnvironmentalProtection Agency’s Draft IRIS Assessment of Formaldehyde.National Academies Press. 194p., April. Available from: http://www.nap.edu/catalog/13142.html [last accessed 08 April 2011].

Neas LM, Dockery DW, Koutrakis P, Speizer, FE. (1999). Fine particlesand peak flow in children: Acidity versus mass. Epidemiology, 10,550–3.

Nicolich M. (2007). Letter Report to US EPA, Air and Radiation Docket.Re: Comments on EPA’s Proposed ‘‘National Ambient Air QualityStandards for Ozone’’, 72 Fed Reg. 37, 818 (11 July 2007). DocketNo. EPA-HQ-OAR-2005-0172. October 9.

O’Connor GT, Neas L, Vaughn B, et al. (2008). Acute respiratory healtheffects of air pollution on children with asthma in US inner cities.J Allergy Clin Immunol, 121, 1133–9.

Rhomberg LR, Bailey LA, Goodman JE, et al. (2011a). Is exposureto formaldehyde in air causally associated with leukemia? – ahypothesis-based weight-of-evidence analysis. Crit Rev Toxicol, 41,555–621.

Rhomberg LR, Bailey LA, Goodman JE. (2010). Hypothesis-basedweight of evidence: a tool for evaluating and communicatinguncertainties and inconsistencies in the large body of evidence inproposing a carcinogenic mode of action – naphthalene as an example.Crit Rev Toxicol, 40, 671–96.

Rhomberg LR, Chandalia JK, Long CM, Goodman JE. (2011b).Measurement error in environmental epidemiology and the shape ofexposure–response curves. Crit Rev Toxicol, 41, 651–71.

Rhomberg LR, Goodman JE, Bailey LA, et al. (2013). A surveyof frameworks for best practices in weight-of-evidence analyses.Crit Rev Toxicol. DOI: 10.3109/10408444.2013.832727.

Romieu I, Meneses F, Ramirez M, et al. (1998). Antioxidant supple-mentation and respiratory functions among workers exposed to highlevels of ozone. Am J Respir Crit Care Med, 158, 226–32.

Sarnat JA, Brown KW, Schwartz J, et al. (2005). Ambientgas concentrations and personal particulate matter exposures: impli-cations for studying the health effects of particles. Epidemiology, 16,385–95.

Sarnat JA, Schwartz J, Catalano PJ, Suh HH. (2001). Gaseous pollutantsin particulate matter epidemiology: confounders or surrogates?Environ Health Perspect, 109, 1053–61.

Schelegle ES, Morales CA, Walby WF, et al. (2009). 6.6-Hour inhalationof ozone concentrations from 60 to 87 parts per billion in healthyhumans. Am J Respir Crit Care Med, 180, 265–72.

Smith RL, Xu B, Switzer P. (2009). Reassessing the relationship betweenozone and short-term mortality in U.S. urban communities. InhalToxicol, 21, 37–61.

Spencer-Hwang R, Knutsen SF, Soret S, et al. (2011). Ambient airpollutants and risk of fatal coronary heart disease among kidneytransplant recipients. Am J Kidney Dis, 58, 608–16.

Stieb DM, Szyszkowicz M, Rowe BH, Leech JA. (2009). Air pollutionand emergency department visits for cardiac and respiratory condi-tions: A multi-city time-series analysis. Environ Health, 8, 25.

Stylianou M, Nicolich MJ. (2009). Cumulative effects and thresholdlevels in air pollution mortality: Data analysis of nine large US citiesusing the NMMAPS dataset. Environ Pollut, 157, 2216–23.

Swaen G, van Amelsvoort L. (2009). A weight of evidence approach tocausal inference. J Clin Epidemiol, 62, 270–7.

Taubes G. (1995). Epidemiology faces its limits. Science, 269,164–9.

US EPA. (2005). Guidelines for carcinogen risk assessment. RiskAssessment Forum, EPA/630/P-03/001F. March.

US EPA. (2006). Air quality criteria for ozone and related photochemicaloxidants. Volume I. EPA 600/R-05/004aF. 821p., February.

US EPA. (2008a). Integrated science assessment for oxides of nitrogen.EPA/600/R-08/071. 260p., July.

US EPA. (2008b). Integrated science assessment for sulfur oxides –health criteria. EPA/600/R-08/047F. September.

US EPA. (2009). Integrated science assessment for particulate matter(final). EPA/600/R-08/139F. 2228p., December.

US EPA. (2010a). Integrated science assessment for carbon monoxide.EPA/600/R-09/019F. 593p., January.

US EPA. (2010b). Toxicological review of formaldehyde – inhalationassessment (CAS No. 50-00-0) in support of summary information onthe integrated risk information system (IRIS). Volumes I–IV (Draft).EPA/635/R-10/002A. 1043p., June.

US EPA. (2012). Integrated science assessment for lead (third externalreview draft). EPA/600/R-10/075C. 1762p., November.

US EPA. (2013a). Integrated science assessment for ozone and relatedphotochemical oxidants (final). EPA/600/R–10/076F. 1251p.,February.

US EPA. (2013b). Process of reviewing the national ambient air qualitystandards. 2p., June. Available from: http://www.epa.gov/ttn/naaqs/review.html [last accessed 19 Aug 2013].

Wacholder S, Hartge P, Lubin JH, Dosemeci M. (1995). Non-differentialmisclassification and bias towards the null: a clarification [letter to theeditor]. Occup Environ Med, 52, 557.

Xia Y, Tong H. (2006). Cumulative effects of air pollution on publichealth. Stat Med, 25, 3548–59.

Zanobetti A, Schwartz J. (2008). Is there adaptation in the ozonemortality relationship: a multi-city case-crossover analysis. EnvironHealth, 7, 22.

Zanobetti A, Schwartz J. (2011). Ozone and survival in four cohorts withpotentially predisposing diseases. Am J Respir Crit Care Med, 184,836–41.

DOI: 10.3109/10408444.2013.837864 NAAQS Causal Framework Evaluation 849

Page 22: Evaluation of the causal framework used for setting …...Evaluation of the causal framework used for setting National Ambient Air Quality Standards Julie E. Goodman, Robyn L. Prueitt,

Copyright of Critical Reviews in Toxicology is the property of Taylor & Francis Ltd and itscontent may not be copied or emailed to multiple sites or posted to a listserv without thecopyright holder's express written permission. However, users may print, download, or emailarticles for individual use.