studies of diagnostic tests thomas b. newman, md, mph october 11, 2007
TRANSCRIPT
Studies of Diagnostic Tests
Thomas B. Newman, MD, MPH
October 11, 2007
Overview Common biases of studies of diagnostic test
accuracy– Incorporation bias– Verification bias– Double gold standard bias– Spectrum bias
Prevalence, spectrum and nonindependence Meta-analysis of diagnostic tests Checklist & Systematic approach Examples:
– Visual assessment of jaundice– Physical examination for presentation
Example: BNP study (Chapter 4, Problem 3)
Studies of Diagnostic TestsIncorporation Bias
Gold standard: determination of Congestive Heart Failure (CHF) by two cardiologists
Blinded to BNP but not to Chest X-ray Chest X-ray found to be highly predictive of
CHF Incorporation bias for assessment of Chest X-
ray, not BNP
*Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med 2002;347(3):161-7.
Verification Bias*
Inclusion criterion: gold standard was applied Subjects with positive index tests are more
likely to be referred for the gold standard. Example: V/Q Scan as a test for pulmonary
embolism (PE; blood clot in lungs). – Gold standard is a pulmonary arteriogram– Retrospective study of patients receiving
arteriograms to rule out PE – Patients with negative V/Q scans less likely to be
referred for PA-gram
*AKA Work-up, Referral Bias, or Ascertainment Bias
Verification Bias
PA-gram+ PA-gram-
V/Q Scan + a b
V/Q Scan - c d
Sensitivity (a/(a+c)) is biased UP.
Specificity (d/(b+d)) is biased DOWN.
Double Gold Standard Bias
One gold standard (e.g. biopsy) is applied in patients with positive index test, another gold standard (e.g., clinical follow-up) is applied in patients with a negative index test.
Studies of Diagnostic TestsDouble Gold Standard
Test: V/Q Scan Disease: PE Gold Standard: PA-gram in patients who had one,
clinical follow-up in patients who didn’t Study Population: All patients presenting to the ED
who received a V/Q scan. Assume some patients did not get PA-gram because
of normal/low probability V/Q scans but would have had positive PA-grams. Instead they had negative clinical follow-up and were counted as true negatives. If they had had PA-grams, they would have been counted as false negatives
*PIOPED. JAMA 1990;263(20):2753-9.
Studies of Diagnostic TestsDouble Gold Standard
PA-Gram +
PA-Gram -
V/Q Scan + a b
V/Q Scan - c d
Sensitivity (a/(a+c)) biased UPSpecificity (d/(b+d)) biased UP
Double Gold Standard Bias: Ultrasound diagnosis of intussusception
Intussusception No Intussusception
U/S + 37 7U/S - 3 104Total 40 111
Sens = 37/40 = 93%Spec = 104/111 = 94%
Intussusception No Intussusception
U/S + 37 7U/S - 3 104Total 40 111
Sens = 37/40 = 93%Spec = 104/111 = 94%
Intussusception No Intussusception
U/S +U/S -Total
Spectrum of Disease, Nondisease and Test Results
Disease is often easier to diagnose if severe
“Nondisease” is easier to diagnose if patient is well than if the patient has other diseases
Test results will be more reproducible if ambiguous results excluded
Spectrum Bias
Sensitivity depends on the spectrum of disease in the population being tested.
Specificity depends on the spectrum of non-disease in the population being tested.
Example: Absence of Nasal Bone (on 13-week ultrasound) as a Test for Chromosomal Abnormality
Spectrum Bias Example: Absence of Nasal Bone as a Test for Chromosomal Abnormality
Sensitivity = 229/333 = 69%BUT the D+ group only included fetuses with
Trisomy 21
Nasal Bone Absent D+ D- LR Yes 229 129 27.8No 104 5094 0.32Total 333 5223
D+ group excluded 295 fetuses with other chromosomal abnormalities (esp. Trisomy 18)
Among these fetuses, sensitivity 32% (not 69%)
What decision is this test supposed to help with?– If it is whether to do CVS or amnio, these 295
fetuses should be included!
Spectrum Bias: Absence of Nasal Bone as a Test for Chromosomal Abnormality
Sensitivity = 324/628 = 52%NOT 69% obtained when the D+ group only included
fetuses with Trisomy 21
Spectrum Bias:Absence of Nasal Bone as a Test for Chromosomal Abnormality, effect of including other trisomies in D+ group
Nasal Bone Absent D+ D- LR Yes 229 + 95 =324 129 20.4No 104 + 200=304 5094 0.50Total 333 + 295=628 5223
Quiz: What if we considered the nasal bone absence as a test for Trisomy 21?
Then instead of excluding subjects with other chromosomal abnormalities or including them as D+, we should count them as D-
What would happen to sensitivity? What would happen to specificity?
Prevalence, spectrum and nonindependence
Prevalence (prior probability) of disease may be related to disease severity
One mechanism is different spectra of disease or nondisease
Another is that whatever is causing the high prior is related to the same aspect of the disease as the test
Prevalence, spectrum and nonindependence
Examples– Iron deficiency– Diseases identified
by screening– UA for UTI
Sensitivity Specificity LR+ LR-
High Prior 92% 42% 1.6 0.19
Low Prior 56% 78% 2.5 0.56
Meta-analyses of Diagnostic Tests
Systematic and reproducible approach to finding studies
Summary of results of each study Investigation into heterogeneity Summary estimate of results, if
appropriate
MRI for the diagnosis of MS
Whiting et al. BMJ 2006;332:875-84
Studies of Diagnostic Test Accuracy: Checklist Was there an independent, blind
comparison with a reference (“gold”) standard of diagnosis?
Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)?
Was the reference standard applied regardless of the diagnostic test result?
Was the test (or cluster of tests) validated in a second, independent group of patients?
From Sackett et al., Evidence-based Medicine,2nd ed. (NY: Churchill Livingstone), 2000. p 68
Systematic Approach
Authors and funding source Research question
– Relevance?– What decision is the test supposed to help
you make? Study design
– Timing of measurements of predictor and outcome
– Cross-sectional vs “case-control sampling
Systematic Approach, cont’d
Study subjects– Disease subjects representative?– Nondiseased subjects representative?– If not, in what direction will results be affected?
Predictor variable– How was the test done? – Is it difficult?– Will it be done as well in your setting?
Systematic Approach, cont’d Outcome variable
– Is the “Gold Standard” really gold?– Were those measuring it blinded to results of the
index test? Results & Analysis
– Were all subjects analyzed– If predictive value was reported, is prevalence
similar to your population– Would clinical implications change depending on
location of true result within confidence intervals? Conclusions
– Do they go beyond data?– Do they apply to patients in your setting?
Should every newborn have a bilirubin test before discharge?
About 60% of newborns develop some jaundice
Usually it is harmless Current practice: Check bilirubin level if
jaundice appears significant Proposal: check it on all newborns
Kernicterus Public Information Campaign Draft Posters
Advancement of Dermal Icterus in the Jaundiced Newborn
Kramer LI, AJDC 1969;118:454
Accuracy of Clinical Judgment in Neonatal Jaundice* RQ: How well can clinicians estimate bilirubin
levels in jaundiced newborns? Study Design: cross-sectional study Subjects: 122 healthy term newborns (mean
age 2 days) whose total serum bilirubin (TSB) was measured in the course of standard newborn care
*Moyer et al., Archives Peds Adol Med 2000; 154:391
Accuracy of Clinical Judgment in Neonatal Jaundice* Measurements:
– Jaundice assessed by attendings, nurse practitioners and pediatric residents (absent/slight/obvious) at each body part and Total Serum Bilirubin (TSB) estimated
– TSB levels measured in clinical laboratory Analysis
– Agreement for jaundice at each body part by Weighted Kappa
– Sensitivity and specificity for TSB ≥ 12 mg/dL
*Moyer et al., Archives Peds Adol Med 2000; 154:391
Results: 1.
Moyer et al., APAM 2000; 154:391
Results: 2
Moyer et al., APAM 2000; 154:391
Sensitivity of jaundice below the nipple line for TSB ≥ 12 mg/dL = 97%
Specificity = 19%
Editor’s Note: The take-home message for me is that no jaundice below the nipple line equals no bilirubin test, unless there’s some other indication.
--Catherine D. DeAngelis, MD
Issues: 1
No information on the numbers of different types of examiners or their years of experience– Generalizability uncertain
No CI around sensitivity and specificity– Sensitivity based upon 67/69– 95% CI: 90% to 99.6%
Issues: 2 Verification bias (Type 1)
– Infants NOT jaundiced below the nipples not likely to have a TSB measured
– Sensitivity too high, specificity too low
– What effect on NPV?
TSB >= 12 TSB <12Jaundice below nipples
a b
No jaundice below nipples
c d
Issues for universal screening How often would the bilirubin test alter
management? How often would this affect outcomes?
– None of the bilirubin levels in the study was dangerously high
What other effects might universal bilirubin screening have?
CDC Posters
E-mail from a parent -1To: <[email protected]>Subject: my hyperbili sonDate: Thu, 11 Aug 2005 Dear Dr Newman, I would like your input as to the prognosis with my son. He had a neonatal jaundice that was horribly mismanaged and I am now a hysterical mom.... My son was born [Wednesday] 4/13/2005 at 10am...On Sat night we had him tested, at 8pm TBR was 24, Coombs test positive. He was admitted under double lights and his TBR was 16 on Sun morn...
E-mail from a parent -2He was breast fed throughout and had a strong suck. He is now 4 months old and milestones seem within developmental norms. Hearing seems ok. I am sleepless, hysterical and depressed. How concerned for the future do I have to be? Please could you get back to me asap. Thanking you, Tracey P
Diagnostic Accuracy of Clinical Examination for Detection of Non-cephalic Presentation in Late Pregnancy* RQ: (above)
– important to know presentation before onset of labor to know whether to try external version
Study design: Cross sectional study Subjects:
– 1633 women with singleton pregnancies at 35-37 weeks at antenatal clinics at a Women’s and Babies Hospital in Australia
– 96% of those eligible for the study consented
*BMJ 2006;333:578-80
Diagnostic Accuracy of Clinical Examination for Detection of Non-cephalic Presentation in Late Pregnancy* Predictor variable
– Clinical examination by one of more than 60 clinicians
• residents or registrars 55%• midwives 28%• obstetricians 17%
– Results classified as cephalic or noncephalic Outcome variable: presentation by
ultrasound, blinded to clinical examination
*BMJ 2006;333:578-80
Diagnostic Accuracy of Clinical Examination for Detection of Non-cephalic Presentation in Late Pregnancy* Results
No significant differences in accuracy by experience level
Conclusions: clinical examination is not sensitive enough
*BMJ 2006;333:578-80
D+ D- TotalTest + 91 74 165 PV+ 55%Test - 39 1429 1468 PV- 97%Total 130 1503 1633
Sens Spec70% 95%
95% CI 62-78% 94-96%