studies of diagnostic tests thomas b. newman, md, mph october 11, 2007

Studies of Diagnostic Tests

Thomas B. Newman, MD, MPH

October 11, 2007

Overview Common biases of studies of diagnostic test

accuracy– Incorporation bias– Verification bias– Double gold standard bias– Spectrum bias

Prevalence, spectrum and nonindependence Meta-analysis of diagnostic tests Checklist & Systematic approach Examples:

– Visual assessment of jaundice– Physical examination for presentation

Example: BNP study (Chapter 4, Problem 3)

Studies of Diagnostic TestsIncorporation Bias

Gold standard: determination of Congestive Heart Failure (CHF) by two cardiologists

Blinded to BNP but not to Chest X-ray Chest X-ray found to be highly predictive of

CHF Incorporation bias for assessment of Chest X-

ray, not BNP

*Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med 2002;347(3):161-7.

Verification Bias*

Inclusion criterion: gold standard was applied Subjects with positive index tests are more

likely to be referred for the gold standard. Example: V/Q Scan as a test for pulmonary

embolism (PE; blood clot in lungs). – Gold standard is a pulmonary arteriogram– Retrospective study of patients receiving

arteriograms to rule out PE – Patients with negative V/Q scans less likely to be

referred for PA-gram

*AKA Work-up, Referral Bias, or Ascertainment Bias

Verification Bias

PA-gram+ PA-gram-

V/Q Scan + a b

V/Q Scan - c d

Sensitivity (a/(a+c)) is biased UP.

Specificity (d/(b+d)) is biased DOWN.

Double Gold Standard Bias

One gold standard (e.g. biopsy) is applied in patients with positive index test, another gold standard (e.g., clinical follow-up) is applied in patients with a negative index test.

Studies of Diagnostic TestsDouble Gold Standard

Test: V/Q Scan Disease: PE Gold Standard: PA-gram in patients who had one,

clinical follow-up in patients who didn’t Study Population: All patients presenting to the ED

who received a V/Q scan. Assume some patients did not get PA-gram because

of normal/low probability V/Q scans but would have had positive PA-grams. Instead they had negative clinical follow-up and were counted as true negatives. If they had had PA-grams, they would have been counted as false negatives

*PIOPED. JAMA 1990;263(20):2753-9.

Studies of Diagnostic TestsDouble Gold Standard

PA-Gram +

PA-Gram -

V/Q Scan + a b

V/Q Scan - c d

Sensitivity (a/(a+c)) biased UPSpecificity (d/(b+d)) biased UP

Double Gold Standard Bias: Ultrasound diagnosis of intussusception

Intussusception No Intussusception

U/S + 37 7U/S - 3 104Total 40 111

Sens = 37/40 = 93%Spec = 104/111 = 94%


U/S + 37 7U/S - 3 104Total 40 111

Sens = 37/40 = 93%Spec = 104/111 = 94%


U/S +U/S -Total

Spectrum of Disease, Nondisease and Test Results

Disease is often easier to diagnose if severe

“Nondisease” is easier to diagnose if patient is well than if the patient has other diseases

Test results will be more reproducible if ambiguous results excluded

Spectrum Bias

Sensitivity depends on the spectrum of disease in the population being tested.

Specificity depends on the spectrum of non-disease in the population being tested.

Example: Absence of Nasal Bone (on 13-week ultrasound) as a Test for Chromosomal Abnormality

Spectrum Bias Example: Absence of Nasal Bone as a Test for Chromosomal Abnormality

Sensitivity = 229/333 = 69%BUT the D+ group only included fetuses with

Trisomy 21

Nasal Bone Absent D+ D- LR Yes 229 129 27.8No 104 5094 0.32Total 333 5223

D+ group excluded 295 fetuses with other chromosomal abnormalities (esp. Trisomy 18)

Among these fetuses, sensitivity 32% (not 69%)

What decision is this test supposed to help with?– If it is whether to do CVS or amnio, these 295

fetuses should be included!

Spectrum Bias: Absence of Nasal Bone as a Test for Chromosomal Abnormality

Sensitivity = 324/628 = 52%NOT 69% obtained when the D+ group only included

fetuses with Trisomy 21

Spectrum Bias:Absence of Nasal Bone as a Test for Chromosomal Abnormality, effect of including other trisomies in D+ group

Nasal Bone Absent D+ D- LR Yes 229 + 95 =324 129 20.4No 104 + 200=304 5094 0.50Total 333 + 295=628 5223

Quiz: What if we considered the nasal bone absence as a test for Trisomy 21?

Then instead of excluding subjects with other chromosomal abnormalities or including them as D+, we should count them as D-

What would happen to sensitivity? What would happen to specificity?

Prevalence, spectrum and nonindependence

Prevalence (prior probability) of disease may be related to disease severity

One mechanism is different spectra of disease or nondisease

Another is that whatever is causing the high prior is related to the same aspect of the disease as the test

Prevalence, spectrum and nonindependence

Examples– Iron deficiency– Diseases identified

by screening– UA for UTI

Sensitivity Specificity LR+ LR-

High Prior 92% 42% 1.6 0.19

Low Prior 56% 78% 2.5 0.56

Meta-analyses of Diagnostic Tests

Systematic and reproducible approach to finding studies

Summary of results of each study Investigation into heterogeneity Summary estimate of results, if

appropriate

MRI for the diagnosis of MS

Whiting et al. BMJ 2006;332:875-84

Studies of Diagnostic Test Accuracy: Checklist Was there an independent, blind

comparison with a reference (“gold”) standard of diagnosis?

Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)?

Was the reference standard applied regardless of the diagnostic test result?

Was the test (or cluster of tests) validated in a second, independent group of patients?

From Sackett et al., Evidence-based Medicine,2nd ed. (NY: Churchill Livingstone), 2000. p 68

Systematic Approach

Authors and funding source Research question

– Relevance?– What decision is the test supposed to help

you make? Study design

– Timing of measurements of predictor and outcome

– Cross-sectional vs “case-control sampling

Systematic Approach, cont’d

Study subjects– Disease subjects representative?– Nondiseased subjects representative?– If not, in what direction will results be affected?

Predictor variable– How was the test done? – Is it difficult?– Will it be done as well in your setting?

Systematic Approach, cont’d Outcome variable

– Is the “Gold Standard” really gold?– Were those measuring it blinded to results of the

index test? Results & Analysis

– Were all subjects analyzed– If predictive value was reported, is prevalence

similar to your population– Would clinical implications change depending on

location of true result within confidence intervals? Conclusions

– Do they go beyond data?– Do they apply to patients in your setting?

Should every newborn have a bilirubin test before discharge?

About 60% of newborns develop some jaundice

Usually it is harmless Current practice: Check bilirubin level if

jaundice appears significant Proposal: check it on all newborns

Kernicterus Public Information Campaign Draft Posters

Advancement of Dermal Icterus in the Jaundiced Newborn

Kramer LI, AJDC 1969;118:454

Accuracy of Clinical Judgment in Neonatal Jaundice* RQ: How well can clinicians estimate bilirubin

levels in jaundiced newborns? Study Design: cross-sectional study Subjects: 122 healthy term newborns (mean

age 2 days) whose total serum bilirubin (TSB) was measured in the course of standard newborn care

*Moyer et al., Archives Peds Adol Med 2000; 154:391

Accuracy of Clinical Judgment in Neonatal Jaundice* Measurements:

– Jaundice assessed by attendings, nurse practitioners and pediatric residents (absent/slight/obvious) at each body part and Total Serum Bilirubin (TSB) estimated

– TSB levels measured in clinical laboratory Analysis

– Agreement for jaundice at each body part by Weighted Kappa

– Sensitivity and specificity for TSB ≥ 12 mg/dL

*Moyer et al., Archives Peds Adol Med 2000; 154:391

Results: 1.

Moyer et al., APAM 2000; 154:391

Results: 2

Moyer et al., APAM 2000; 154:391

Sensitivity of jaundice below the nipple line for TSB ≥ 12 mg/dL = 97%

Specificity = 19%

Editor’s Note: The take-home message for me is that no jaundice below the nipple line equals no bilirubin test, unless there’s some other indication.

--Catherine D. DeAngelis, MD

Issues: 1

No information on the numbers of different types of examiners or their years of experience– Generalizability uncertain

No CI around sensitivity and specificity– Sensitivity based upon 67/69– 95% CI: 90% to 99.6%

Issues: 2 Verification bias (Type 1)

– Infants NOT jaundiced below the nipples not likely to have a TSB measured

– Sensitivity too high, specificity too low

– What effect on NPV?

TSB >= 12 TSB <12Jaundice below nipples

a b

No jaundice below nipples

c d

Issues for universal screening How often would the bilirubin test alter

management? How often would this affect outcomes?

– None of the bilirubin levels in the study was dangerously high

What other effects might universal bilirubin screening have?

CDC Posters

E-mail from a parent -1To: <[email protected]>Subject: my hyperbili sonDate: Thu, 11 Aug 2005 Dear Dr Newman, I would like your input as to the prognosis with my son. He had a neonatal jaundice that was horribly mismanaged and I am now a hysterical mom.... My son was born [Wednesday] 4/13/2005 at 10am...On Sat night we had him tested, at 8pm TBR was 24, Coombs test positive. He was admitted under double lights and his TBR was 16 on Sun morn...

E-mail from a parent -2He was breast fed throughout and had a strong suck. He is now 4 months old and milestones seem within developmental norms. Hearing seems ok. I am sleepless, hysterical and depressed. How concerned for the future do I have to be? Please could you get back to me asap. Thanking you, Tracey P

Diagnostic Accuracy of Clinical Examination for Detection of Non-cephalic Presentation in Late Pregnancy* RQ: (above)

– important to know presentation before onset of labor to know whether to try external version

Study design: Cross sectional study Subjects:

– 1633 women with singleton pregnancies at 35-37 weeks at antenatal clinics at a Women’s and Babies Hospital in Australia

– 96% of those eligible for the study consented

*BMJ 2006;333:578-80

Diagnostic Accuracy of Clinical Examination for Detection of Non-cephalic Presentation in Late Pregnancy* Predictor variable

– Clinical examination by one of more than 60 clinicians

• residents or registrars 55%• midwives 28%• obstetricians 17%

– Results classified as cephalic or noncephalic Outcome variable: presentation by

ultrasound, blinded to clinical examination

*BMJ 2006;333:578-80

Diagnostic Accuracy of Clinical Examination for Detection of Non-cephalic Presentation in Late Pregnancy* Results

No significant differences in accuracy by experience level

Conclusions: clinical examination is not sensitive enough

*BMJ 2006;333:578-80

D+ D- TotalTest + 91 74 165 PV+ 55%Test - 39 1429 1468 PV- 97%Total 130 1503 1633

Sens Spec70% 95%

95% CI 62-78% 94-96%

studies of diagnostic tests thomas b. newman, md, mph october 11, 2007

Documents

pegold standard

pe patients

consecutive patients

test results

negative index test

negative vq scans

positive pagrams

bedside test