greenland s. - on the bias produced by quality scores in meta-analysis, and a hierarchical view of...

9
Biostatistics (2001), 2, 4, pp. 463–471 Printed in Great Britain On the bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions SANDER GREENLAND Department of Epidemiology, UCLA School of Public Health and Department of Statistics, UCLA College of Letters and Science, 22333 Swenson Drive, Topanga CA 90290, USA [email protected] KEITH O’ROURKE Department of Surgery and Clinical Epidemiology Unit, University of Ottawa, Ottawa, Canada SUMMARY Results from better quality studies should in some sense be more valid or more accurate than results from other studies, and as a consequence should tend to be distributed differently from results of other studies. To date, however, quality scores have been poor predictors of study results. We discuss possible reasons and remedies for this problem. It appears that ‘quality’ (whatever leads to more valid results) is of fairly high dimension and possibly non-additive and nonlinear, and that quality dimensions are highly application-specific and hard to measure from published information. Unfortunately, quality scores are often used to contrast, model, or modify meta-analysis results without regard to the aforementioned problems, as when used to directly modify weights or contributions of individual studies in an ad hoc manner. Even if quality would be captured in one dimension, use of quality scores in summarization weights would produce biased estimates of effect. Only if this bias were more than offset by variance reduction would such use be justified. From this perspective, quality weighting should be evaluated against formal bias-variance trade-off methods such as hierarchical (random-coefficient) meta-regression. Because it is unlikely that a low-dimensional appraisal will ever be adequate (especially over different applications), we argue that response-surface estimation based on quality items is preferable to quality weighting. Quality scores may be useful in the second stage of a hierarchical response-surface model, but only if the scores are reconstructed to maximize their correlation with bias. Keywords: Empirical Bayes; Hierarchical regression; Meta-analysis; Mixed models; Multilevel modeling; Pooling; Quality scores; Random-coefficient regression; Random effects; Risk assessment; Risk regression; Technology assessment. 1. I NTRODUCTION A common objection to meta-analytic summaries is that they combine results from studies of disparate quality. ‘Study quality’ is not given formal definition in the literature, but ‘quality appraisals’ usually involve classifying the study according to a number of traits or items that are reported in or determinable from the published paper. These traits are presumed to predict the accuracy of study results, where ‘accuracy’ is a function of both systematic and random error (Rothman and Greenland, 1998). Often, To whom correspondence should be addressed c Oxford University Press (2001)

Upload: oscura

Post on 14-Apr-2016

212 views

Category:

Documents


0 download

DESCRIPTION

s

TRANSCRIPT

Page 1: Greenland S. - On the Bias Produced by Quality Scores in Meta-Analysis, And a Hierarchical View of Proposed Solutions(2001)(9)

Biostatistics (2001), 2, 4, pp. 463–471Printed in Great Britain

On the bias produced by quality scores inmeta-analysis, and a hierarchical view of proposed

solutionsSANDER GREENLAND∗

Department of Epidemiology, UCLA School of Public Health and Department of Statistics, UCLACollege of Letters and Science, 22333 Swenson Drive, Topanga CA 90290, USA

[email protected]

KEITH O’ROURKE

Department of Surgery and Clinical Epidemiology Unit, University of Ottawa, Ottawa, Canada

SUMMARYResults from better quality studies should in some sense be more valid or more accurate than results

from other studies, and as a consequence should tend to be distributed differently from results of otherstudies. To date, however, quality scores have been poor predictors of study results. We discuss possiblereasons and remedies for this problem. It appears that ‘quality’ (whatever leads to more valid results) isof fairly high dimension and possibly non-additive and nonlinear, and that quality dimensions are highlyapplication-specific and hard to measure from published information. Unfortunately, quality scores areoften used to contrast, model, or modify meta-analysis results without regard to the aforementionedproblems, as when used to directly modify weights or contributions of individual studies in an ad hocmanner. Even if quality would be captured in one dimension, use of quality scores in summarizationweights would produce biased estimates of effect. Only if this bias were more than offset by variancereduction would such use be justified. From this perspective, quality weighting should be evaluatedagainst formal bias-variance trade-off methods such as hierarchical (random-coefficient) meta-regression.Because it is unlikely that a low-dimensional appraisal will ever be adequate (especially over differentapplications), we argue that response-surface estimation based on quality items is preferable to qualityweighting. Quality scores may be useful in the second stage of a hierarchical response-surface model, butonly if the scores are reconstructed to maximize their correlation with bias.

Keywords: Empirical Bayes; Hierarchical regression; Meta-analysis; Mixed models; Multilevel modeling; Pooling;Quality scores; Random-coefficient regression; Random effects; Risk assessment; Risk regression; Technologyassessment.

1. INTRODUCTION

A common objection to meta-analytic summaries is that they combine results from studies of disparatequality. ‘Study quality’ is not given formal definition in the literature, but ‘quality appraisals’ usuallyinvolve classifying the study according to a number of traits or items that are reported in or determinablefrom the published paper. These traits are presumed to predict the accuracy of study results, where‘accuracy’ is a function of both systematic and random error (Rothman and Greenland, 1998). Often,

∗To whom correspondence should be addressed

c© Oxford University Press (2001)

Page 2: Greenland S. - On the Bias Produced by Quality Scores in Meta-Analysis, And a Hierarchical View of Proposed Solutions(2001)(9)

464 S. GREENLAND AND K. O’ROURKE

each of these quality items is assigned a number of points based on the a priori judgment of clinicalinvestigators, then summed into a ‘quality score’ that purports to capture the essential features of themultidimensional quality space (Chalmers et al., 1981). This score is then used as a weighting factor inaveraging across studies (Fleiss and Gross, 1991; Berard and Bravo, 1998; Moher et al., 1998), or used asa covariate for stratifying or predicting study results.

Despite scientific objections to amalgamating quality items into a univariate score (Greenland,1994a,b), and empirical findings of deficiencies in such scores (Emerson et al., 1990; Juni et al., 1999),quality scoring and weighting remains common in meta-analysis (Cho and Bero, 1994; Moher et al., 1995,1998). Rigorous justifications for the weighting by quality scores are lacking or in error (Detsky et al.,1992; Greenland, 1994a,b). Moher et al. (1998) claimed that weighting by quality scores resulted in theleast statistical heterogeneity, but the measure of heterogeneity used (deviance) was arbitrarily multipliedby the quality weights; for instance, if the quality scores were all equal to 0.5, the pooled estimate wouldnot change but this new measure of heterogeneity would drop by 50%! Given the lack of valid justificationfor quality weighting, it is worrisome that some statisticians as well as clinicians continue to use and evenpromote such scores, even though alternatives have long been available; for example, a weighting schemebased on precision and on magnitude of bias was proposed by Cox (1982).

We will show that, in general, quality-score weighting methods produce biased effect estimates. Thisis so even when their component quality items capture all traits contributing to bias in study-specificestimates. Furthermore, one cannot rationalize this bias as being a worthwhile trade-off against variancewithout assessment of its size and its performance against other methods. We argue that quality-scoreweighting methods need to be replaced by direct modeling of quality dimensions, in part because the bestestimate may not be any weighted average of study results. As an alternative to current uses of qualityscores we propose mixed-effects (hierarchical) regression analysis of quality items with a summary qualityscore in the second stage. For this to work well the quality score needs to be constructed to maximize itscorrelation with bias.

2. A FORMAL EVALUATION OF QUALITY SCORING

The following evaluation is based on a response-surface model for meta-analysis (Rubin, 1990). Themodel posits that the expected study-specific effect estimate is a function of the true effect in the study,plus bias terms that are functions of quality items. To illustrate, let i = 1, . . . , I index studies, and let Xbe a row vector of J quality items; for example, X could contain indicators of study design such as use ofblinding (of patients, of treatment administration, of outcome evaluation) in meta-analyses of randomizedtrials, or type of sampling (cohort, case-control, cross-sectional) in meta-analyses of observational studies.Also define δi = expected study-specific effect estimate, θi = true effect of study treatment, xi = valueof X for study i . The study-specific estimate δi is usually the observed difference in an outcome measureamong treated and untreated groups in study i , such as a difference in means, proportions, or log odds (inthe latter case the expectations and biases discussed here are asymptotic); δi may also be a fitted coefficientfrom a regression of outcome on treatment type. In contrast, θi represents the object of inquiry, the actualimpact of treatment on the outcome measure (Rubin, 1990; Vanhonacker, 1996).

Let x0 be the value of X that is a priori taken to represent the ‘best quality’ with respect to themeasured quality items. The response-surface approach models δi as a regression function of θi and xi .As a simple example, consider the additive model

δi = θi + bi + γi (1)

where bi is the unknown bias produced by deviation of xi from the ideal x0. A common (implicit)assumption in quality scoring is that this ideal value is specified accurately, in the sense that there wouldbe no bias contributed by measured quality items in a study with the ideal value x0. While this assumption

Page 3: Greenland S. - On the Bias Produced by Quality Scores in Meta-Analysis, And a Hierarchical View of Proposed Solutions(2001)(9)

Bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions 465

may be reasonable, there will still be the bias component γi of δi that is due to other, unmeasured qualityitems.

Model 1 has 3I parameters; because there are only I observations δ1, . . . , δI , one needs some severeconstraints to identify effects. Many analyses employ models in which neither bi nor γi is present and inwhich the θi are constrained to be either a constant θ or random effects drawn from a normal distributionwith unknown mean and variance. These models have been criticized for ignoring known factors thatlead to variation in the effects θi across studies (heterogeneity of effects), such as differences in treatmentprotocols and subjects (e.g. differences in the age and sex composition of the study populations), as wellas for ignoring known sources of bias and for increasing sensitivity of the analysis to publication bias(Cox, 1982, p. 48; L’Abbe et al., 1987, p. 231; Greenland, 1994a; Poole and Greenland, 1999). Thosecritics advocate instead a ‘meta-regression’ approach to effect variation, in which measured study-specificcovariates are used to model variation in θi or δi (e.g. Vanhonacker, 1996; Greenland, 1998).

Although we agree with the aforementioned criticisms, there is some uncertainty fixed-effectsummaries do not capture, and the addition of a fictional residual random effect to the model, howeverunreal, is often the only attempt made to capture that uncertainty. Hence, we begin by assuming that theθi can be treated as random draws from a distribution with mean α and variance σ 2, and that the γi arezero (i.e. X captures all bias information), as do many authors (e.g. Berard and Bravo, 1998). With theseassumptions, model 1 simplifies to

δi = θi + bi = α + bi + εi (2)

where E(θi ) = α, E(εi ) = 0 and Var(εi ) = σ 2. Here, α is the average effect of treatment in a hypotheticalsuperpopulation of studies, of which the included studies are supposed to be a random sample.

Under model 2, bi may be viewed as the perfect but unknown quality score, in that if it were known onecould unbiasedly estimate α from the bias-adjusted estimates δi − bi , assuming that the random effects εi

and the sampling-error residuals δi − δi are uncorrelated. In the same manner, unbiased estimation of thebi would also suffice for unbiased estimation of α, assuming that errors in estimating bi are independentof other errors and of model parameters. Conversely, if the average study-specific bias was nonzero,information on the average bi would be necessary as well as sufficient for unbiased estimation of α undermodel 2.

Quality-scoring methods

Quality-scoring methods replace the vector of quality items X with a unidimensional scoring rule s(X),where s(x) is a fixed function specified a priori that typically varies from 0 for a useless study to s(x0)

for a perfect one (e.g. 0–100%). Define si = s(xi ) for i = 0, 1, . . . , I . The two main uses of the qualityscores si are

Quality weighting: average the study-specific estimates δi after multiplying the usual inverse-variance weight by a function of si . For example, if vi = var(δi |δi ), the unconditionalvariance of δi is vi + σ 2, the usual random-effects weight for δi is wi = 1/(vi + σ 2), and theusual ‘quality-adjusted’ weight is si wi (Berard and Bravo, 1998).

Quality-score stratification or regression: stratify or regress the estimates δi on the si , thenestimate α as the predicted value of δ at s0. For example, assuming bias to proportional tos0 − si , fit the submodel of model 2,

δi = α∗ + (s0 − si )β∗ + ε∗

i , (3)

then take α∗, the predicted estimate for a perfect study, as the estimate of the average effectα.

Page 4: Greenland S. - On the Bias Produced by Quality Scores in Meta-Analysis, And a Hierarchical View of Proposed Solutions(2001)(9)

466 S. GREENLAND AND K. O’ROURKE

A problem quality weighting shares with most weighting schemes is that it produces biased estimates,even if X contains all bias-related covariates (as in model 2) and our quality score is perfectly predictiveof bias. To see this, let ui be any study-specific weight and let a subscript u denote averaging over studiesusing the ui as weights: for example,

δu ≡ �uiδi/�i ui . (4)

Under model 2,

δu = θu + bu = α + εu + bu . (5)

In words, δu equals the u-average effect θu plus the u-average bias bu . Thus, any unbiased estimator of δu

will be biased both for the conditional average effect θu of the studies observed and for the superpopulationaverage effect α (unless of course no study is biased or there is a fortuitous cancellation of biases). Thisproblem arises whether we use the usual estimator, for which ui = wi ≡ 1/(vi + τ 2), or the quality-weighted estimator, for which ui = wi si .

The best one can do with quality weighting is reduce bias by shifting weight to less biased studies.Ordinarily, however, this shift would increase the variance of the weighted estimator, possibly not farenough to achieve the optimal bias-variance trade-off. To appreciate the issue here, if bias reduction wereour only goal and the bias bi declined monotonically with increasing quality score si , the minimum-biasestimator based on the scores would have weights u1 = 1 for si = max(si ), 0 otherwise. In other words,if bias is all that matters and we trust our quality ranking, we should restrict our average solely to studieswith the maximal score (even if there is only one such study) and throw out the rest (O’Rourke, 2001).That this restriction is not made must reflect beliefs that there is useful information in studies of less-than-maximal quality, and that some degree of bias is tolerable if it comes with a large variance reduction. Thisis a pragmatic viewpoint, especially if one considers the inevitable inaccuracies of any real score for biasprediction. It also recognizes that errors due to bias are not necessarily more costly than errors due tochance, and are certainly not infinitely more costly (as is implicitly assumed by limiting one’s attention tounbiased estimators).

Under model 2, the quality-score regression estimator α∗ from model 3 will incorporate a bias α∗ −α.This summary bias will be zero if the s0 − si are proportional to the true biases bi ; otherwise α∗ can easilybe far from α, even if the scores are a monotone function of the bias bi and so properly rank the studieswith respect to bias. As a simple numerical example, suppose the δi are log relative risks; there is no trueeffect (θi = α = εi = 0), so δi = bi ; that one-quarter of studies have si = s0 = 100, bi = 0; one-halfhave si = 90, bi = ln(0.6); and that one-quarter have si = 60, bi = ln(0.5). If vi is constant acrossstudies, then (from a weighted least squares regression of δi on s0 − si ) we get α∗ = −0.225, whichresults in a geometric mean relative risk from the quality regression of exp(α∗) = 0.80, rather than thetrue relative risk of 1.00.

3. PROBLEMS WITH QUALITY SCORES AS BIAS PREDICTORS

When a quality scoring rule contains many items irrelevant to bias (e.g. an indicator of whether powercalculations were reported), these items will disproportionately reduce some of the resulting scores.Inclusion of such items can distort rankings as well. The bias α∗ − α can be viewed as arising frommisspecification of the true bias function b(X) by a one-parameter model [s0 − s(X)]β∗. Given themultidimensional nature of X , one should see that this surrogate involves an incredibly strict constrainton b(X), with misspecification all but guaranteed.

The scoring rule s(X) is usually a weighted sum Xq = � j q j X j of the J quality items in X , whereq is a column vector of ‘quality weights’ specified a priori. Inspection of common scoring rules reveals

Page 5: Greenland S. - On the Bias Produced by Quality Scores in Meta-Analysis, And a Hierarchical View of Proposed Solutions(2001)(9)

Bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions 467

that the weights q j are fairly arbitrary item scores assigned by a few clinical experts, without reference todata on the relative importance of the items in determining bias. ‘Score validations’ are typically circular,in that the criteria for validation hinge on score agreement with opinions about which studies are ‘highquality,’ not on actual measurements of bias.

Perhaps the worst problem with common quality scores is that they fail to account for the direction ofbias induced by quality deficiencies (Greenland, 1994b). This failing can virtually nullify the value of aquality score in regression analyses. As a simple numerical example, suppose that trials are given 50 pointsout of 100 for having placebo controls and 50 points out of 100 for having validated outcome assessment,and that lack of outcome validation results in nondifferential misclassification. Suppose also that lackof placebo controls induces a bias in the risk difference of 0.1 (because then the placebo effect occursonly in the treated group), that lack of outcome validation induces a bias of −0.1 (because the resultingmisclassification induces a bias towards the null), and that the biases are approximately additive. If n1studies lack placebo controls but have validated outcomes and n2 lack validated outcomes but haveplacebo controls, the average biases among studies with quality scores of 0, 50, and 100 will be 0,(n1 − n2)/(n1 + n2), and 0. Thus, the score will not even properly rank studies with respect to bias,even though it will properly rank studies with respect to number of deficiencies.

While the preceding example is grossly oversimplified, we think such bias cancellation phenomena areinevitable when quality scores are composed of many items combined without regard to bias direction (e.g.those in Cho and Bero (1994) or Moher et al. (1995)). To avoid this problem, a quality score would haveto be reconstructed as a bias score with signs attached to the contributing items. The resulting summaryscore should then be centered so that zero indicates no bias expected a priori, and values above and belowzero should indicate relative amounts of expected positive and negative bias in δi . Thus, if the δi are logrelative risks but expected biases are elicited on the relative-risk scale, the latter expectations must betranslated to the log scale when constructing the quality scores.

4. BEYOND QUALITY SCORES

Expert opinions may be helpful determining items related to bias (i.e. what to include in X ), althoughsome checklist items are indefensible (e.g. study power conveys no information beyond that in the variancevi , which is already used in the study weight). Expert opinions may even provide a rough idea of therankings of the items in importance. Nonetheless, a more robust approach, one less dependent on thevagaries and prejudices of such opinions, would employ a more flexible surrogate for bias than thatprovided by quality scores.

Quality-item regression

If the number of quality items is not too large relative to the number and variety of studies in the meta-analysis (which may often be the case upon dropping items irrelevant to bias), one can fit a quality-itemregression model, such as

δi = α† + (x0 − xi )β† + ε

†i (6)

where β† is a vector of unknown parameters. This approach is equivalent to treating the item weight vectorq as an unknown parameter (with β∗ absorbed into q). Conversely, quality-score regression (model 3) isjust a special case of model 6 with β† specified a priori up to an unknown scalar multiplier β∗, as may beseen from

(s0 − si )β∗ = (x0q − xi q)β∗ = (x0 − xi )qβ∗. (7)

Page 6: Greenland S. - On the Bias Produced by Quality Scores in Meta-Analysis, And a Hierarchical View of Proposed Solutions(2001)(9)

468 S. GREENLAND AND K. O’ROURKE

In other words, quality-score regression employs model 6 with a constraint β† = qβ∗, with β∗ unknown.From a Bayesian perspective, this constraint is a ‘dogmatic’ prior distribution on β† that is concentratedentirely on the line qβ∗ in the J -dimensional parameter space for β†.

Model 6 also corresponds to a J -parameter linear model for the bias function b(X). Under model 2,the bias inherent in this assumption is α† − α. Because model 3 is a submodel of model 6, however, thebias α† − α will be no more than the bias α∗ − α inherent in the quality-score regression; on the otherhand, the variance α† may be much larger than that of α∗.

A compromise model

The possible bias-variance trade-off between α† and α∗ suggests use of an empirical-Bayes compromise.One such approach relaxes the dogmatism of the quality-score constraint β† = qβ∗ by using thisconstraint as a second-stage mean for β†. For example, we could treat β† as a random-coefficient vectorwith mean qβ∗ and covariance matrix τ 2 D2, where β∗ and τ 2 are second-stage parameters and D is adiagonal matrix of prespecified scale factors. D is often implicitly left as an identity matrix, but detailedsubject-matter specifications are possible; see Witte et al. (1994) for an example. Reparametrizing byβ† = qβ∗ + ν where E(ν) = 0, cov(ν) = τ 2 D2, the compromise model is then

δi = α† + (x0 − xi )(qβ∗ + ν) + ε†i = α† + (s0 − si )β

∗ + (x0 − xi )ν + ε†i . (8)

This model may be fit by penalized quasi-likelihood (Breslow and Clayton, 1993; Greenland, 1997),generalized least squares (Goldstein, 1995), or Markov chain Monte Carlo methods (which require explicitspecification of distributional families for ν and ε

†i ). One can also pre-specify τ 2 based on subject matter

(Greenland, 1993, 2000); models 3 and 6 can then be seen as limiting cases of model 8 as τ 2 → 0 andτ 2 → ∞.

Theory, simulations, and applications suggest that the compromise model (8) should produce moreaccurate estimators than the extremes of models 3 or 6 if J � 4 and the score s(X) correlates well withthe bias b(X); otherwise, if the score correlates poorly with bias, the unconstrained model (6) or one thatshrinks coefficients to zero might do better (Morris, 1983; Greenland, 1993, 1997; Witte and Greenland,1996; Breslow et al., 1998). Hence, second-stage use of quality scores requires reconstructing the scoresto maximize their correlation with bias, as discussed above. That reconstruction, a critical point of subject-matter and logical input to the meta-analytic model, thus appears to us as essential for any intelligent useof quality scores.

5. DISCUSSION

Models 3, 6 and 8 are based on an assumption that the true effect θi varies in a simple random manneracross studies; this assumption rarely has any support or even credibility in the light of backgroundknowledge (Greenland, 1994a, 1998; Poole and Greenland, 1999). Nonetheless, this problem can beaddressed by adding potential effect modifiers (factors that affect θi ) to the model. Let Z be a row vectorof modifiers, such as descriptors of the treatment protocol and the patient population, and let zi be thevalue of Z in study i . The true effects are now modeled as a function of Z , such as

θi = α† + ziλ† + ε

†i , (9)

which expands model 8 to

δi = α† + ziλ† + (s0 − si )β

∗ + (x0 − xi )ν + ε†i . (10)

Page 7: Greenland S. - On the Bias Produced by Quality Scores in Meta-Analysis, And a Hierarchical View of Proposed Solutions(2001)(9)

Bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions 469

This model may be extended further by specifying a second-stage model for λ†, so that λ† (like β†) mayhave both fixed and random components.

All the models described above are special cases of mixed-effects meta-regression models describedand studied by earlier authors (Cochran, 1937; Yates and Cochran, 1938; Cox, 1982; Raudenbush andBryk, 1985; Stram, 1996; Platt et al., 1999). From this perspective, common quality-score methods are notonly biased; they also do not exploit the full flexibility of these models. This flexibility offers an alternativeto the strict reliance an expert judgment inherent in common methods, and should be sufficient to justify ashift in current practice. There are several major obstacles to this shift, however. Among them are a generallack of familiarity with mixed models, especially in the ‘semi-Bayesian’ form that is arguably most usefulin epidemiology and medical research (Greenland, 1993, 2000), and a lack of generalized-linear mixedmodeling modules within most packages used by health researchers. We regard as unrealistic the responsewe commonly get from academic statisticians, to the effect that everyone should just be using BUGS orsome such Markov chain Monte Carlo software; this response seems oblivious to the subtle specificationand convergence issues that arise in such use. Much more transparent and stable methods are available,such as mixed modeling with data augmentation priors (Bedrick et al., 1996), which can be implementedeasily with ordinary regression software by adding ‘prior data’ to the data set (Greenland, 2001; Greenlandand Christensen, 2001). If necessary, one may use extensions of likelihood methods to model variation inheterogeneity (Smyth and Verbyla, 1999).

Finally, we note that the overall validity of any meta-analysis is limited by the detail of information thatcan be obtained on included studies, and by the completeness of study ascertainment (Light and Pillemer,1984; Begg and Berlin, 1988). Such problems will be best addressed by improved editorial requirementsfor reporting (Meinert, 1998) and by efforts to identify all studies and enter them into accessible onlineregistries.

REFERENCES

BERARD, A. AND BRAVO, G. (1998). Combining studies using effect sizes and quality scores: application to boneloss in postmenopausal women. Journal of Clinical Epidemiology 51, 801–807.

BEDRICK, E. J., CHRISTENSEN, R. AND JOHNSON, W. (1996). A new perspective on generalized linear models.Journal of the American Statistical Association 91, 1450–1460.

BEGG, C. B. AND BERLIN, J. A. (1988). Publication bias: a problem in interpreting medical data. Journal of theRoyal Statistical Society, Series A 151, 419–463.

BRESLOW, N. E. AND CLAYTON, D. G. (1993). Approximate inference in generalized linear mixed models. Journalof the American Statistical Association 88, 9–25.

BRESLOW, N., LEROUX, B. AND PLATT, R. (1998). Approximate hierarchical modelling of discrete datain epidemiology. Statistical Methods in Medical Research 7, 49–62.

CHALMERS, T. C., SMITH, H. JR., BLACKBURN, B., SILVERMAN, B., SCHROEDER, B., REITMAN, D. AND

AMBROZ, A. (1981). A method for assessing the quality of a randomized control trial. Controlled Clinical Trials2, 31–49.

CHO, M. K. AND BERO, L. A. (1994). Instruments for reassessing the quality of drug studies published in themedical literature. Journal of the American Medical Association 272, 101–104.

COCHRAN, W. G. (1937). Problems arising in the analysis of a series of similar experiments. Journal of the RoyalStatistical Society 4, 102–118.

COX, D. R. (1982). Combination of data. In Kotz, S. and Johnson, N. L. (eds), Encyclopedia of Statistical Science 2,New York: Wiley.

Page 8: Greenland S. - On the Bias Produced by Quality Scores in Meta-Analysis, And a Hierarchical View of Proposed Solutions(2001)(9)

470 S. GREENLAND AND K. O’ROURKE

DETSKY, A. S., NAYLOR, C. D., O’ROURKE, K., MCGEER, J. A. AND L’ABBE, K. A. (1992). Incorporatingvariations in the quality of individual randomized trials into meta-analysis. Journal of Clinical Epidemiology 45,255–265.

EMERSON, J. D., BURDICK, E., HOAGLIN, D. C., MOSTELLER, F. AND CHALMERS, T. C. (1990). An empiricalstudy of the possible relation of treatment differences to quality scores in controlled randomized clinical trials.Controlled Clinical Trials 11, 339–352.

FLEISS, J. L. AND GROSS, A. J. (1991). Meta-analysis in epidemiology, with special reference to studies of theassociation between exposure to environmental tobacco smoke and lung cancer: a critique. Journal of ClinicalEpidemiology 44, 127–139.

GOLDSTEIN, H. (1995). Multilevel Statistical Models, 2nd edn. London: Edward Arnold.

GREENLAND, S. (1993). Methods for epidemiologic analyses of multiple exposures: a review and a comparativestudy of maximum-likelihood, preliminary testing, and empirical-bayes regression. Statistics in Medicine 12, 717–736.

GREENLAND, S. (1994a). A critical look at some popular meta-analytic methods. American Journal of Epidemiology140, 290–296.

GREENLAND, S. (1994b). Quality scores are useless and potentially misleading. American Journal of Epidemiology140, 300–301.

GREENLAND, S. (1997). Second-stage least squares versus penalized quasi-likelihood for fitting hierarchical modelsin epidemiologic analysis. Statistics in Medicine 16, 515–526.

GREENLAND, S. (1998). Meta-analysis. In Rothman, K. J. and Greenland, S. (eds), Modern Epidemiology,Chapter 32. Philadelphia: Lippincott, pp. 643–672.

GREENLAND, S. (2000). When should epidemiologic regressions use random coefficients? Biometrics 56, 915–921.

GREENLAND, S. (2001). Putting background information about relative risks into conjugate priors. Biometrics 57, inpress.

GREENLAND, S. AND CHRISTENSEN, R. (2001). Data-augmentation priors for bayesian and semi-bayes analyses ofconditional logistic and proportional hazards regression. Statistics in Medicine 20, in press.

JUNI, P., WITSCHI, A., BLOCH, R. AND EGGER, M. (1999). The hazards of scoring the quality of clinical trials formeta-analysis. Journal of the American Medical Association 282, 1054–1060.

L’ABBE, K. A., DETSKY, A. S. AND O’ROURKE, K. (1987). Meta-analysis in clinical research. Annals of InternalMedicine 107, 224–233.

LIGHT, R. J. AND PILLEMER, D. B. (1984). Summing Up: The Science of Reviewing Research. Cambridge, MA:Harvard University Press.

MEINERT, C. L. M. (1998). Beyond CONSORT: the need for improved reporting standards for clinical trials. Journalof the American Medical Association 279, 1487–1489.

MOHER, D., JADA, A. R., NICHOL, G., PENMAN, M., TUGWELL, P. AND WALSH, S. (1995). Assessing the qualityof randomized controlled trials: an annotated bibliography of scales and checklists. Controlled Clinical Trials 16,62–73.

MOHER, D., PHAM, B., JONES, A., COOK, D. J., JADAD, A. R., MOHER, M., TUGWELL, P. AND KLASSEN,T. P. (1998). Does quality of reports of randomised trials affect estimates of intervention efficacy reported inmeta-analyses? Lancet 352, 609–613.

MORRIS, C. N. (1983). Parametric empirical bayes: theory and applications (with discussion). Journal of theAmerican Statistical Association 78, 47–65.

O’ROURKE, K. (2001). Meta-analysis: conceptual issues of addressing apparent failure of individual study replication

Page 9: Greenland S. - On the Bias Produced by Quality Scores in Meta-Analysis, And a Hierarchical View of Proposed Solutions(2001)(9)

Bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions 471

or ‘inexplicable’ heterogeneity. In Ahmed, S. E. and Reid, N. (eds), Empirical Bayes and Likelihood Inference,New York: Springer, pp. 161–183.

PLATT, R. W., LEROUX, B. G. AND BRESLOW, N. (1999). Generalized linear mixed models for meta-analysis.Statistics in Medicine 18, 643–654.

POOLE, C. AND GREENLAND, S. (1999). Random-effects meta-analysis are not always conservative. AmericanJournal of Epidemiology 150, 469–475.

RAUDENBUSH, S. W. AND BRYK, A. S. (1985). Empirical-bayes meta-analysis. Journal of Educational Statistics10, 75–98.

ROTHMAN, K. J. AND GREENLAND, S. (1998). Accuracy considerations in study design. In Rothman, K. J. andGreenland, S. (eds), Modern Epidemiology, 2nd edn. Philadelphia: Lippincott, pp. 135–145.

RUBIN, D. B. (1990). A new perspective. In Wachter, K. W. and Straf, M. L. (eds), The Future of Meta-Analysis,New York: Russell Sage, pp. 155–166.

SCHULTZ, K. F., CHALMERS, I., HAYES, R. J. AND ALTMAN, D. G. (1995). Empirical evidence of bias—dimensions of methodological quality associated with estimates of treatment effects in controlled trials. Journalof the American Medical Association 273, 408–412.

SENN, S. (1996). The AB/BA cross-over: how to perform the two-stage analysis if you can’t be persuaded thatyou shouldn’t. In Hansen and de Ridder (eds), Liber Amicorum Roel van Strik, Rotterdam: Erasmus University,pp. 93–100.

SMYTH, G. K. AND VERBYLA, A. P. (1999). Adjusted likelihood methods for modeling dispersion in generalizedlinear models. Environmetrics 10, 695–709.

STRAM, D. O. (1996). Meta-analysis of published data using a linear mixed-effects model. Biometrics 52, 536–554.

VANHONACKER, W. R. (1996). Meta-analysis and response surface extrapolation: a least squares approach. AmericanStatistician 50, 294–299.

WITTE, J. S., GREENLAND, S., HAILE, R. W. AND BIRD, C. L. (1994). Hierarchical regression analysis appliedto a study of multiple dietary exposures and breast cancer. Epidemiology 5, 612–621.

WITTE, J. S. AND GREENLAND, S. (1996). Simulation study of hierarchical regression. Statistics in Medicine 15,1161–1170.

YATES, F. AND COCHRAN, W. G. (1938). The analysis of groups of experiments. Journal of Agricultural Sciences28, 556–580.

[Received 6 November, 2000; revised 4 April, 2001; accepted for publication 6 April, 2001]