an emerging consensus on grading recommendations?

8
中国循证医学杂志 2007年第7卷第1www.cjebm.org.cn 1 An Emerging Consensus on Grading Recommendations? Gordon Guyatt §§* MSc, M.D. * Department of Medicine, McMaster University, Hamilton, Ontario, Canada §§ Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada Gordon Guyatt MSc, M.D. An Emerging Consensus on Grading Recommendations? Chin J Evid-based Med, 2007, 7(1): 1–8. Clinical practice guidelines: The challenge of grading Clinical practice guidelines have improved in quality over the past ten years by adhering to a few basic principles, such as conducting thorough systematic reviews of relevant evidence, and grading the recommendations and the quality of the underlying evidence. The large number of systems of measuring the quality of evidence and recommendations that have emerged are, however, confusing [1] . An international group of guideline developers, systematic reviewers, and clinical epidemiologists has taken on the ambitious task of helping resolve the confusion among the different systems of rating evidence and recommendations. The group has wide representation from many organizations including the Agency for Healthcare Research and Quality in the USA, the National Institute for Clinical Excellence for England, and Wales, and the World Health Organization. Developing a new uniform ratings system is challenging because all systems have limitations and because many organizations invested a great deal of time and effort to develop their ratings systems and are understandably reluctant to adopt a new system. The GRADE working group first published the results of its work in 2004 in the British Medical Journal [2] . Simpler, clinically oriented descriptions have been published subsequently [3,4] . GRADE has taken care to ensure its suggested system is simple to use and applicable to a wide variety of clinical recommendations that span the full spectrum of medical specialties and clinical care. Quality of evidence The GRADE system classifies recommendations in one of two levels - strong and weak – and quality of evidence of evidence into one of four levels – high, moderate, low and very low. Evidence based on randomized control trials (RCTs) begins with a top rating on GRADE’s 4-level quality of evidence classification (Table 1). GRADE takes into account, however, that not all RCTs are alike, and that five categories of limitations of RCTs may compromise the quality of their evidence (Table 2). 1) First, quality decreases if most of the evidence comes from RCTs with serious methodological flaws such as lack of allocation concealment or blinding, large loss to follow-up, or stopping early for benefit. How lack of blinding can influence the grading is exemplified by a recommendation to treat heparin- induced thrombocytopenia (HIT) complicated by thrombosis with danaparoid sodium. The randomized trial evidence for danaproid use in HIT comes from Table 1 Quality of evidence and their definitions Grade Definition High Further research is very unlikely to change our confi- dence in the estimate of effect. Moderate Further research is likely to have an important impact on our confidence in the estimate of effect and may change the estimate. Low Further research is very likely to have an important im- pact on our confidence in the estimate of effect and is likely to change the estimate. Very low Any estimate of effect is very uncertain. Table 2 Factors in deciding on confidence in estimates of benefits, risks, burden, and costs Factors that may decrease the quality of evidence based on randomized control trials (RCTs) 1) Poor quality of planning and implementation of the available RCTs suggesting high likelihood of bias 2) Inconsistency of results 3) Indirectness of evidence 4) Sparse evidence 5) Reporting bias (including publication bias) Factors that may increase the quality of evidence based on observational studies 1) Large magnitude of effect 2) All plausible confounding would reduce a demonstrated effect 3) Dose-response gradient Contact Address: Dr. Gordon Guyatt, McMaster University, Faculty of Health Sciences, Clinical Epidemiology & Biostatistics, Room 2C12, 1200 Main Street West Hamilton, ON, L8N 3Z5. Tel: (905) 525-9140, x22900; Fax: (905) 524-3841. Email: [email protected]

Upload: independent

Post on 29-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

中国循证医学杂志 2007年第7卷第1期www.cjebm.org.cn

1

An Emerging Consensus on Grading Recommendations?

Gordon Guyatt§§* MSc, M.D.* Department of Medicine, McMaster University, Hamilton, Ontario, Canada§§ Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada

Gordon Guyatt MSc, M.D. An Emerging Consensus on Grading Recommendations? Chin J Evid-based Med, 2007, 7(1): 1–8.

Clinical practice guidelines: The challenge of gradingClinical practice guidelines have improved in quality over the past ten years by adhering to a few basic principles, such as conducting thorough systematic reviews of relevant evidence, and grading the recommendations and the quality of the underlying evidence. The large number of systems of measuring the quality of evidence and recommendations that have emerged are, however, confusing[1].

An international group of guideline developers, systematic reviewers, and clinical epidemiologists has taken on the ambitious task of helping resolve the confusion among the different systems of rating evidence and recommendations. The group has wide representation from many organizations including the Agency for Healthcare Research and Quality in the USA, the National Institute for Clinical Excellence for England, and Wales, and the World Health Organization. Developing a new uniform ratings system is challenging because all systems have limitations and because many organizations invested a great deal of time and effort to develop their ratings systems and are understandably reluctant to adopt a new system.

The GRADE working group first published the results of its work in 2004 in the British Medical Journal[2]. Simpler, clinically oriented descriptions have been published subsequently[3,4]. GRADE has taken care to ensure its suggested system is simple to use and applicable to a wide variety of clinical recommendations that span the full spectrum of medical specialties and clinical care.

Quality of evidenceThe GRADE system classifies recommendations in one of two levels - strong and weak – and quality of evidence of evidence into one of four levels – high, moderate, low and very low. Evidence based on randomized control trials (RCTs) begins with a

top rating on GRADE’s 4-level quality of evidence classification (Table 1). GRADE takes into account, however, that not all RCTs are alike, and that five categories of limitations of RCTs may compromise the quality of their evidence (Table 2).

1) First, quality decreases if most of the evidence comes from RCTs with serious methodological flaws such as lack of allocation concealment or blinding, large loss to follow-up, or stopping early for benefit. How lack of blinding can influence the grading is exemplified by a recommendation to treat heparin-induced thrombocytopenia (HIT) complicated by thrombosis with danaparoid sodium. The randomized trial evidence for danaproid use in HIT comes from

Table 1 Quality of evidence and their definitions

Grade Definition

HighFurther research is very unlikely to change our confi-dence in the estimate of effect.

Moderate

Further research is likely to have an important impact on our confidence in the estimate of effect and may change the estimate.

Low

Further research is very likely to have an important im-pact on our confidence in the estimate of effect and is likely to change the estimate.

Very low Any estimate of effect is very uncertain.

Table 2 Factors in deciding on confidence in estimates of benefits, risks, burden, and costs

Factors that may decrease the quality of evidence based on randomized control trials (RCTs)

1) Poor quality of planning and implementation of the available RCTs suggesting high likelihood of bias

2) Inconsistency of results

3) Indirectness of evidence

4) Sparse evidence

5) Reporting bias (including publication bias)

Factors that may increase the quality of evidence based on observational studies

1) Large magnitude of effect

2) All plausible confounding would reduce a demonstrated effect

3) Dose-response gradient

Contact Address: Dr. Gordon Guyatt, McMaster University, Faculty of Health Sciences, Clinical Epidemiology & Biostatistics, Room 2C12, 1200 Main Street West Hamilton, ON, L8N 3Z5. Tel: (905) 525-9140, x22900; Fax: (905) 524-3841. Email: [email protected]

Chin J Evid-based Med, 2007, Vol.7(1)www.cjebm.org.cn

2

Com

mentary

an unblinded trial in which the outcome was the clinicians’ assessment of when the thromboembolism had resolved, a subjective judgement. As a result, an ACCP guideline panel system rated the quality of the evidence as moderate rather than strong[5].

2) A second reason for downgrading is inconsistency of results. When several RCTs yield widely differing estimates of treatment effect (heterogeneity or variability in results) investigators look for explanations for that heterogeneity. For instance, drugs may have larger relative effects in sicker, or in less sick, populations. When heterogeneity exists, but investigators fail to identify a plausible explanation, the strength of recommendations from even rigorous RCTs is weaker. For example, RCTs of pentoxifylline in patients with intermittent claudication have shown conflicting results that so far defy explanation. Acknowledging the unexplained heterogeneity, a guideline panel rated the quality of the evidence for pentoxifylline as moderate, rather than high[6].

3) Indirectness may compromise the quality of evidence. Evidence is indirect if there are no head-to-head comparisons between therapeutic alternatives. For instance, drug benefit plans or formularies have to choose between funding of a number of bisphosphonates, including alendronate and risedronate, for prevention of osteoporotic fractures. Unfortunately, the decision must be made on a comparison of trials evaluating alendronate against placebo, and risedronate against placebo, rather than direct comparisons of alendronate and risedronate. Evidence may also be indirect if the population differs (we are interested in valvular atrial fibrillation, but all RCTs are in non-valvular atrial fibrillation), intervention (we’d like to know about relatively low-dose angiotensin-converting enzyme inhibition, but all trials are in higher dose), or outcome (we’d like to know about long-term effectiveness, but all trials have only short follow-up durations).

As an example of differences in populations, avian flu is a disease caused by influenza A (H5N1) virus and is associated with a high case fatality (approximately 33 to more than 50% of patients die). Potential exposure to the virus raises the concern of chemoprophylaxis. Pharmacological interventions could include the use of antiviral neuraminidase inhibitors such as oseltamivir. Oseltamivir, however, has been used only in studies of patients with seasonal influenza with a different influenza A virus, a quite different patient population.

4) When total sample size is small, and outcome events are few in number, our uncertainty about estimates of benefit and risk increase. For instance, a well-designed and rigorously conducted RCT addressed the use of nadroparin, a low molecular weight heparin, in patients with cerebral venous sinus thrombosis. Of 30 treated patients, three had

a poor outcome, as did six of 29 patients in the control group. The investigators’ analysis suggests a 38% RRR of a poor outcome, but the result was not statistically significant. GRADE continues to debate the appropriate thresholds for decreasing strength of inference: are confidence intervals too wide, how few events are too few?

5) A final reason for downgrading quality of evidence is a high likelihood of reporting bias. The quality of evidence may be reduced if investigators fail to report studies (typically those that show no effect) or outcomes (typically those that may be harmful or for which no effect was observed) or if other reasons lead to results not being reported. Unfortunately, guideline panels are still required to make guesses about the likelihood of reporting bias. A prototypical situation that should elicit suspicion of reporting bias is when published evidence includes a number of small trials, all of which are industry funded[7]. For example, 14 trials of flavanoids in patients with hemorrhoids have shown apparent large benefits, but enrolled a total of only 1,432 patients[8]. The heavy involvement of sponsors in most of these trials raises questions of whether unpublished trials suggesting no benefit exist.

While observational studies (e.g. cohort studies) start with a “low quality” rating, they may be graded upwards if the magnitude of the treatment effect is very large (e.g., hip replacement for severe hip osteoarthritis), if there is evidence of a dose response relationship, or if all apparent confounders would decrease the magnitude of the treatment effect (Table 2). For example, a systematic review revealed higher mortality in for-profit than in not-for-profit hospitals[9]. This result was found despite the fact that for-profit hospitals usually admit healthier patients with a higher socio-economic status and have more resources at their disposal. These potential confounders, if anything, favor for-profit hospitals. If such confounders were taken into account, the magnitude of effect favoring not-for-profit hospitals would be even larger.

Strength of RecommendationsAs noted, the GRADE system offers 2 levels of recommendations: strong and weak. When an intervention’s benefits clearly outweigh its risks and burden, or clearly do not, strong recommendations are warranted. On the other hand, when the tradeoff between benefits and risks is less certain, either because of low quality evidence or because high quality evidence suggests benefits and risks are closely balanced, weak recommendations become appropriate.

This 2-level approach is easy to put into practice. For strong recommendations in which it is clear that benefits far outweigh risks, or risks far outweigh benefits, virtually all patients will make the same choice (e.g. aspirin in the setting of acute myocardial infarction). In such instances, physicians can confidently recommend

中国循证医学杂志 2007年第7卷第1期www.cjebm.org.cn

3

treatment. For weak recommendations, different patients may choose different approaches to treatment. One example is the use of hormone replacement therapy for menopausal hot flashes. Under these circumstances, clinicians must know the evidence and communicate the evidence to their patients, or conduct a detailed inquiry to ensure their recommendations are consistent with patients’ values and preferences[10].

Determinants of strength of evidenceBeyond the quality of the evidence, a number of other factors may bear on whether recommendations are strong or weak (Table 3).

The choice of adjusted dose warfarin versus aspirin for prevention of stroke in patients with atrial fibrillation illustrates a number of the factors that will influence the strength of a recommendation. A systematic review and meta-analysis found a relative risk reduction (RRR) of 46% in all strokes with warfarin versus aspirin[11]. This large effect supports a strong recommendation for warfarin. Furthermore, the relatively narrow 95% confidence interval (RRR 29 to 57%) suggests that warfarin provides a RRR of at least 29%, and further supports a strong recommendation.

At the same time, warfarin is associated with an inevitable burden of keeping dietary intake of vitamin K constant, monitoring the intensity of anticoagulation with blood tests, and living with the increased risk of both minor and major bleeding. Most patients, however, are much more stroke averse than they are bleeding averse[12]. As a result, almost all patients with

high risk of stroke would choose warfarin, suggesting the appropriateness of a strong recommendation.

This last point emphasizes the importance of the patient’s baseline risk, sometimes called control event rate, of the adverse outcome that treatment is designed to avoid (Table 3). Consider a 65 year-old patient with atrial fibrillation and no other risk factors for stroke. This individual’s risk for stroke in the next year is approximately 2%. Considering the relative risk reduction and this baseline risk one can derive the absolute magnitude of an effect (table 3). Dose-adjusted warfarin can, relative to aspirin, reduce the risk to approximately 1% for an absolute risk reduction of 1% (2% – 1%). Some patients who are very stroke averse may consider the down sides of taking warfarin well worth it. Given the relative narrow confidence interval that follow from the confidence interval around the relative risk reduction one could make a strong recommendation to use warfarin if all patients were equally stroke adverse. Some patients are, however, likely to consider the benefit not worth the risks and inconvenience. When, across the range of patient values, fully informed patients are liable to make different choices, editors should offer weak (Grade 2) recommendations.

ConclusionThe GRADE system is rigorous in its methodology, yet practical to use. It is neither too complex nor misleadingly simple. The GRADE group’s vision was extremely ambitious: to have a system of evaluating

Table 3 Factors in deciding on a strong or weak recommendation

Issue Example

Methodological quality of the evidence supporting es t imates of l ike ly benef i t , and l ike ly r i sk , inconvenience, and costs

Many high quality randomized trials have demonstrated the benefit of inhaled steroids in asthma while only case series have examined the utility of pleurodesis in pneumothorax

Importance of the outcome that treatment prevents Preventing post-phlebitic syndrome with thrombolytic therapy in DVT in contrast to pre-venting death from PE.

Magnitude of treatment Effect Clopidogrel versus aspirin leads to a smaller stroke reduction in TIA (8.7% RRR*[13]) than anticoagulation versus placebo in AF (68% RRR)

Precision of estimate of treatment Effect ASA versus placebo in AF has a wider confidence interval than ASA for stroke prevention in patients with TIA

Risks associated with therapy ASA and clopidogrel in acute coronary syndromes anticoagulation have a higher risk for bleeding than ASA alone

Burdens of Therapy Taking adjusted-dose warfarin is associated with a higher burden than taking aspirin; war-farin requires monitoring the intensity of anticoagulation and a relatively constant dietary vitamin K intake.

Risk of target event Some surgical patients are at very low risk of post-operative DVT and PE while others sur-gical patients have considerably higher rates of DVT and PE

Costs Clopidogrel has much higher cost in patients with TIA than aspirin

Varying Values Most young, healthy people will put a high value on prolonging their lives (and thus incur suffering to do so); the elderly and infirm are likely to vary in the value they place on pro-longing their lives (and may vary in the suffering they are ready to experience to do so).

*Relative risk reduction

Chin J Evid-based Med, 2007, Vol.7(1)www.cjebm.org.cn

4

Com

mentary

quality of evidence and grading recommendations that would become the international standard. Given the rapid adoption of GRADE in the international community since the original publication in the BMJ in 2004, it sees that this vision may possibly come true.

The Cochrane Collaboration is moving to adopt the GRADE approach to rating of methodological quality. The Endocrine Society was the first North American organization to adopt GRADE for its recommendations while another important organization, the American College of Chest Physicians (ACCP), has adopted a slightly modified version of GRADE. Other North American organizations have followed, including the very prestigious American College of Physicians and the almost equally prestigious American Thoracic Society, the Surviving Sepsis Campaign and the Ontario Ministry of Health Medical Advisory Secretariat. In what might be the most important advance for GRADE dissemination in North America, the extraordinarily successful electronic medical text UpToDate is formally grading recommendations using the GRADE approach. European organizations that have endorsed GRADE include the European Society of Thoracic Surgery, Agencia sanitaria regionale in Bologna Italy, and the German Agency for Quality in Medicine. International groups that have endorsed GRADE include Kidney disease: Improving global outcome, the Surviving sepsis campaign, and Guidelines International Network. The British Medical Journal is encouraging guidelines submission to the BMJ to use the GRADE approach.

If the momentum of uptake continues, GRADE will do more than achieve the worthy and important goal of standardizing systems of grading quality of evidence and recommendations for clinical practice. GRADE may facilitate the evolution toward a world in which expert recommendations for front-line clinicians uniformly adhere to principles of evidence management and guideline development that flow from the intellectual movement we call evidence-based

medicine.

References

1 Schunemann HJ, et al. Letters, numbers, symbols and words: how to communicate grades of evidence and recommendations. Cmaj, 2003, 169(7): p. 677–680.

2 Atkins D, et al. Grading quality of evidence and strength of recommendations. Bmj, 2004, 328(7454): p. 1490.

3 Guyatt G, et al. Grading strength of recommendations and quality of evidence in clinical guidelines: report from an american college of chest physicians task force. Chest, 2006, 129(1): p. 174–181.

4 Schunemann HJ, et al. An official ATS statement: grading the quality of evidence and strength of recommendations in ATS guidelines and recommendations. Am J Respir Crit Care Med, 2006, 174(5): p. 605–614.

5 Wa r k e n t i n T E a n d A . G r e i n a c h e r , H e p a r i n - i n d u c e d thrombocytopenia: recognition, treatment, and prevention: the Seventh ACCP Conference on Antithrombotic and Thrombolytic Therapy. Chest, 2004, 126(3 Suppl): p. 311S–337S.

6 Clagett GP, et al. Antithrombotic therapy in peripheral arterial occlusive disease: the Seventh ACCP Conference on Antithrombotic and Thrombolytic Therapy. Chest, 2004, 126(3 Suppl): p. 609S–626S.

7 Bhandari M, et al. Association between industry funding and statistically significant pro-industry findings in medical and surgical randomized trials. Cmaj, 2004, 170(4): p. 477–480.

8 Alonso-Coello P, et al. Meta-analysis of flavonoids for the treatment of haemorrhoids. Br J Surg, 2006, 93(8): p. 909–920.

9 Devereaux PJ, et al. A systematic review and meta-analysis of studies comparing mortality rates of private for-profit and private not-for-profit hospitals. Cmaj, 2002, 166(11): p. 1399–1406.

10 Charles C., T. Whelan, and A. Gafni, What do we mean by partnership in making decisions about treatment? Bmj, 1999, 319(7212): p. 780–782.

11 van Walraven, C., et al., Oral anticoagulants vs aspirin in nonvalvular atrial fibrillation: an individual patient meta-analysis. Jama, 2002, 288(19): p. 2441–2448.

12 Devereaux, P.A., DR. Gardner, MJ. Putnam, W. Flowerdew, GJ. Brownell, BF. Nagpal, S. Cox, JL., Differences between perspectives of physicians and patients on anticoagulation in patients with atrial fibrillation: observational study. Bmj, 2001, 323(7323): p. 1218–1222.

13 CAPRIE-Steering-Committee, A randomized, blinded trial of clopidogrel versus aspirin in patients at risk of ischemic events. Lancet, 1996. 348: p. 1329–1339.

中国循证医学杂志 2007年第7卷第1期www.cjebm.org.cn

5

临床推荐意见分级标准的认可Gordon Guyatt§§* MSc, M.D.*Department of Medicine, McMaster University, Hamilton, Ontario, Canada§§Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada

Gordon Guyatt MSc, M.D. An Emerging Consensus on Grading Recommendations?Chin J Evid-based Med, 2007, 7(1): 1–8.

临床实践指南分级面临挑战

过去十年,临床实践指南质量的持续改善和提

高得益于采纳了一些基本原则,如根据相关证据生

产的系统评价对推荐意见和潜在证据的质量分级。

但由此而产生的大量证据与推荐意见分级标准系统

相互矛盾,让人困惑[1]。

由临床指南制定者、系统评价员和临床流行

病学家组成的国际性小组已着手承担这项极具挑

战的任务,以帮助解决不同证据和推荐意见分级

标准系统之间的混乱。该国际性小组具有广泛代

表性,包括了如美国卫生保健研究与质量管理局

(AHRQ)、英国和威尔士的国家临床卓越研究所

(NICE)及WHO在内的多家机构。但建立新的统

一分级标准系统决非易事,因为现有分级系统虽均

存在局限性,但每个系统都耗费相应不同生产者的

大量时间和劳力,因而很难让他们轻易接受新分级

标准系统。

2004年,GRADE 工作组首次在BMJ上发表了

自己的研究结果[2],随后又发表了以临床为导向的

简化分级系统——GRADE[3,4]。GRADE 力求使其

分级系统易于掌握,适用于涵盖所有医学专业和临

床护理领域的各种临床推荐意见。

证据质量

GRADE系统将推荐意见分为强、弱2个等级,

又将证据质量分为高、中、低和极低4个等级。基

于RCT的证据在GRADE4级证据质量分级系统中,

被评为高质量(见表1)。但并非所有RCT的质量

都一致,下述5种问题将降低其证据质量(见表

2)。

1)若大部分证据均来源于存在严重方法学问

题的RCT,则质量降低。此类问题包括未进行分配

隐藏、未采用盲法、失访人数过多和过早中止试验

等。达那肝素钠治疗血栓并发肝素诱发血小板减少

症(HIT)的推荐意见是未采用盲法而影响RCT质

量等级的典型例子。达那肝素钠治疗HIT的随机试

验证据来自非盲法试验,结局指标采用由临床医师

判定血栓栓塞是否消失的主观评价指标,故ACCP指南系统将其证据质量评为中等而非高质量[5]。

2)质量等级下降的第二个原因是研究结果的

不一致性。当几个RCT针对同一措施的疗效得出差

异明显的评估结果(异质性或结果的可变性)时,

研究者应寻求对这种异质性的解释,例如药物可能

对病情重或轻的人群效果相对较好。当研究者无法

表 1  证据质量及其定义

质量等级 定义

高 未来研究几乎不可能改变现有疗效评价结果的可信度

中未来研究可能对现有疗效评估有重要影响,可能改变评价结果的可信度

低未来研究很有可能对现有疗效评估有重要影响,改变评估结果可信度的可能性较大

极低 任何疗效的评估都很不确定

表 2  决定利益、风险、负担和费用可信度评价的因素

可能降低随机对照试验证据质量的因素

1) 现有RCT的试验设计和实施质量低下,提示偏倚存在的高度可能性

2) 研究结果的不一致性(异质性)

3) 非直接证据

4) 证据不足

5) 报告偏倚(包括发表偏倚)

提高观察性研究证据质量的因素

1) 效应值很大

2) 所有可能的混杂因素削弱干预组效应值

3) 剂量效应关系

通讯作者联系方式: Dr. Gordon Guyatt, McMaster University, Faculty of Health Sciences, Clinical Epidemiology & Biostatistics, Room 2C12, 1200 Main Street West Hamilton, ON, L8N 3Z5. Tel: (905) 525-9140, x22900; Fax: (905) 524-3841. Email: [email protected]

Chin J Evid-based Med, 2007, Vol.7(1)www.cjebm.org.cn

6

Com

mentary

对存在的异质性作出合理解释时,即使来自严格

RCT的证据,其推荐意见强度也将减弱。例如采用

己酮可可碱治疗间歇性跛行的RCT结果相互矛盾,

且迄今仍无法解释。基于此,指南小组将其证据质

量级别定为中等而非高等[6]。

3)间接比较也可能降低证据质量。如果不同

治疗方法间未进行一一对应的直接比较时,就出现

了间接证据。例如,药物优选计划或处方集筛选哪

种二膦酸盐(如阿伦膦酸盐或利塞膦酸盐)更适合

预防骨质疏松性骨折时仅有阿伦膦酸盐与安慰剂比

较和利塞膦酸盐与安慰剂比较的间接结果,而非

阿伦膦酸与利塞膦酸盐直接比较的试验结果以供决

策。在受试人群不同(如我们关注的是瓣膜性心房

颤动患者,但所有RCT均纳入非瓣膜性心房颤动患

者)、干预措施不同(如我们关注的是较低剂量血

管紧张素转换酶抑制剂的疗效,但所有研究都使用

高剂量)、结局指标不同(如我们关注的是长期疗

效,但所有研究的随访期均较短)时,研究结果也

仅为间接证据。

受试人群不同的一个例子是流感病毒AH5N1导致的禽流感,因其具有高致死性(约33%~50%以上的患者死亡),故潜在的病毒接触风险引起

人们对化学预防的重视。药物干预包括抗病毒神经

氨酸酶抑制剂如奥塞米韦的使用,但仅有奥塞米韦

治疗另一种流感病毒A导致的季节性流感患者的研

究,患者群完全不同。

4)当总样本量小且结局事件的数量少时,将

增加判断利弊的不确定性。例如,一个设计精良、

严格按试验方案执行的那屈肝素(一种小分子量肝

素)治疗脑静脉窦血栓的RCT,试验组3例(共30例)患者的某个结局指标结果不佳,对照组有6例(共29例)。研究者的结论是出现某种结局指标欠

佳的两组的相对危险度(RRR)为38%,但结果无

统计学意义。GRADE小组对如何界定推论强度应

降低一直持有争论,如可信区间是否太宽?结局事

件要少到什么程度才能叫做“罕见的结局事件”?5)最后一个降低证据质量等级的原因是证据

存在发表偏倚的高度可能性。若研究者未报告相关

研究(最典型的是其研究结果显示无效)或结局指

标(最典型的是结局指标显示干预有害或未观察到

疗效),或其他原因导致不发表研究结果也可能降

低证据质量等级。即便如此,临床指南小组仍需猜

测存在发表偏倚的可能性有多大。典型情况是当发

表的证据为一系列完全由企业资助的小样本临床试

验时,应怀疑其存在发表偏倚[7]。如14个类黄酮治

疗痔疮的试验结果貌似疗效确切,但纳入患者的总

数仅1 432例[8]。企业对这些研究的深度资助让人不

由疑惑是否存在结果无效的未发表试验。

观察性研究(如队列研究)一开始被归为低质

量,但若干预措施的疗效值很大(如髋关节置换术

治疗严重的髋关节炎),证据显示存在剂量效应关

系或所有明显混杂因素均减弱干预组疗效值时(见

表2),观察性研究的证据等级将可能提高。如某

系统评价结果显示,盈利性医院的患者死亡率高于

非盈利性医院[9]。该结果是在忽略盈利性医院卫生

资源更多,就诊患者社会经济状况普遍较好、病情

较轻的情况下得出的。若存在潜在混杂因素时,更

有利于盈利性医院。若考虑到这类混杂因素,非盈

利性医院疗效更好的证据强度将提高。

推荐意见强度

如前所述,GRADE系统将推荐意见强度分为

强弱两个等级。当明确显示干预措施利大于弊或弊

大于利时,应评为强推荐。当利弊不确定或无论质

量高低的证据均显示利弊相当时,则视为弱推荐。

将推荐意见分为两个等级便于实际应用。因强

推荐意见明确显示干预措施利大于弊或弊大于利,

故实际上所有患者均将作出相同选择(如阿斯匹林

治疗急性心肌梗塞)。此时,医生能明确推荐这样

的治疗。当推荐强度为弱推荐时,不同的患者可能

选择不同的治疗,如激素替代疗法治疗绝经期潮

热。此时,医生必须了解具体的证据情况并与患者

交流,或详细询问以确保自己的推荐意见符合患者

的价值观和意愿选择[10]。

证据强度的决定因素

除证据质量外,还有一些因素可能影响推荐意

见的强弱(见表3)。

调整剂量的华法林或阿斯匹林预防房颤患者中

风的例子反映了影响推荐意见强弱的多个因素。某

系统评价和Meta 分析发现,华法林相对于阿斯匹

林可减少46%各类中风的相对危险度[11]。因华法林

的疗效明显优于阿斯匹林,故应强推荐。此外,其

95%CI相对较窄(RR为29%~57%),这表明华法

林可减少至少29%的相对危险度,进一步支持结果

为强推荐。

但使用华法林难免会导致患者需长期定量口服

维生素K、监测血液抗凝功能并增加小出血或大出

中国循证医学杂志 2007年第7卷第1期www.cjebm.org.cn

7

血的风险。但相对于出血,大多数患者更怕中风的

危险[12],故几乎所有中风高危患者都选择使用华法

林,这也表明强推荐使用华法林是恰当的。最后需

强调的一点是患者的基线危险度,有时也叫相对事

件发生率,试验设计时应考虑采取相应治疗措施来

避免不良事件。如一个65岁无中风危险因素的房颤

患者,来年发生中风的个体危险度约为2%。考虑

到相对危险度降低和基线危险度,可得出绝对疗

效值(见表3)。调整剂量的华法林与阿斯匹林相

比,可降低约1%的绝对危险度。某些特别担心发

生中风的患者或许认为华法林降低了绝对危险度,

值得使用。假设所有患者发生中风的危险度相当,

且基于相对危险度降低得出的可信区间较窄时,则

仍可得出强推荐华法林的结论。但有些患者可能认

为华法林带来的益处小于其带来的风险和不便。当

患者价值观不同时,充分知情的患者很可能作出不

同的选择,此时推荐意见强度应改为弱推荐。

结论

GRADE分级系统不仅方法学严密也便于临床

应用。它既非过于复杂也非过于简单到让人误解。

GRADE小组的目的非常明确,即建立评价证据和

推荐意见分级的国际标准体系。自2004年首次在

BMJ上发表后,GRADE已迅速被国际社会采纳,

这一期望有望成真。

Cochrane协作网正逐渐接受GRADE分级系统

对方法学质量的分级标准。内分泌协会是北美第

一个接受GRADE推荐意见分级的组织,美国胸科

医师协会采纳了仅稍作修改的GRADE。其他北美

组织如享有盛誉的美国医师协会和美国胸科学会、

“拯救败血症患者运动”协会、安大略省卫生部医

疗顾问组也采用了GRADE。GRADE在北美普及的最重要事件莫过于大获

好评的电子医学书UpToDate正式采纳GRADE作为

推荐意见分级依据。欧洲组织如欧洲胸外科协会,

意大利博洛尼亚地区疗养所、德国医疗质量机构也

采纳了GRADE。国际性组织如肾脏疾病:改善全

球结局指标,拯救败血症患者运动,和国际临床

指南网也采纳了GRADE。BMJ鼓励作者投稿采用

GRADE撰写临床指南。

若有持续需求,GRADE的目标将不仅是建立

临床实践证据和推荐意见质量分级标准体系,还将

推动世界性变革——统一采用源于循证医学的证据

管理和临床指南发展原则为临床一线医生提供“专

家”推荐意见。

参 考 文 献

1 Schunemann HJ, et al. Letters, numbers, symbols and words: how to communicate grades of evidence and recommendations. CMAJ, 2003, 169(7): 677–680.

2 Atkins D, et al. Grading quality of evidence and strength of recommendations. BMJ, 2004, 328(7454): 1490.

3 Guyatt G, et al. Grading strength of recommendations and quality of evidence in clinical guidelines: report from an american college of chest physicians task force. Chest, 2006, 129(1): 174–181.

4 Schunemann HJ, et al. An official ATS statement: grading the quality of evidence and strength of recommendations in ATS guidelines and recommendations. Am J Respir Crit Care Med, 2006, 174(5): 605–614.

5 Wa r k e n t i n T E a n d A . G r e i n a c h e r , H e p a r i n - i n d u c e d thrombocytopenia: recognition, treatment, and prevention: the

表 3  决定推荐意见强弱的因素

因素 举例

证据的方法学质量是否足以支持评估疗效、风险、麻烦和费用

许多高质量随机试验证实吸入性类固醇治疗哮喘有效,仅1个病例系列研究评价了胸膜剥脱术治疗气胸的疗效

治疗可预防的结局指标的重要性 血栓溶解疗法的不同结局指标:深部静脉血栓形成患者出现静脉炎后综合征人数和肺栓塞患者出现死亡的人数

疗效量度大小 氯吡格雷比阿斯匹林降低TIA患者中风危险的作用略大(相对危险度下降8.7%),而抗凝剂却比安慰剂大大降低了房颤患者中风的危险(相对危险度下降68%)

疗效评价的精确度 阿司匹林与安慰剂比较预防中风,对房颤患者的可信区间比对TIA患者的可信区间宽

治疗相关的风险 冠状动脉综合征患者中,联合应用阿斯匹林和氯吡格雷比单用阿斯匹林出血风险更大

治疗负担 服用调整剂量华法林带来的治疗负担高于阿斯匹林,需长期检测血凝功能和定量口服补充维生素K

发生目标事件的风险大小 不同外科患者发生术后深部静脉血栓形成和肺栓塞的风险可高可低

费用 TIA患者采用氯吡格雷治疗的费用远远高于阿斯匹林

不同的价值观 大多数年轻而健康的人更在乎延长生命(即使受到疾病折磨也如此),而年老虚弱的人对延长生命的观念很可能不同(可能是因其长期饱受疾病煎熬所致)

*相对危险度降低

Chin J Evid-based Med, 2007, Vol.7(1)www.cjebm.org.cn

8

Com

mentary

Seventh ACCP Conference on Antithrombotic and Thrombolytic Therapy. Chest, 2004, 126(3 Suppl): 311S–337S.

6 Clagett GP, et al. Antithrombotic therapy in peripheral arterial occlusive disease: the Seventh ACCP Conference on Antithrombotic and Thrombolytic Therapy. Chest, 2004, 126(3 Suppl): 609S–626S.

7 Bhandari M, et al. Association between industry funding and statistically significant pro-industry findings in medical and surgical randomized trials. CMAJ, 2004, 170(4): 477–480.

8 Alonso-Coello P, et al. Meta-analysis of flavonoids for the treatment of haemorrhoids. Br J Surg, 2006, 93(8): 909–920.

9 Devereaux PJ, et al. A systematic review and meta-analysis of studies comparing mortality rates of private for-profit and private not-for-profit hospitals. CMAJ, 2002, 166(11): 1399–1406.

10 Charles C., T. Whelan, and A. Gafni, What do we mean by partnership in making decisions about treatment? BMJ, 1999, 319(7212): 780–782.

11 van Walraven, C., et al., Oral anticoagulants vs aspirin in nonvalvular atrial fibrillation: an individual patient meta-analysis. JAMA, 2002, 288(19): 2441–2448.

12 Devereaux, P.A., DR. Gardner, MJ. Putnam, W. Flowerdew, GJ. Brownell, BF. Nagpal, S. Cox, JL., Differences between perspectives of physicians and patients on anticoagulation in patients with atrial fibrillation: observational study. BMJ, 2001, 323(7323): 1218–1222.

13 CAPRIE-Steering-Committee, A randomized, blinded trial of clopidogrel versus aspirin in patients at risk of ischemic events. LANCET, 1996. 348: 1329–1339.

蔡羽嘉 译 刁 骧 李幼平 审校

收稿日期:2006–12–20 修回日期:2007–02–08本文编辑:蔡羽嘉