1
Comparing Diagnostic Accuracies of Two Tests in
Studies with Verification Bias
Marina Kondratovich, Ph.D.
Division of Biostatistics,
Center for Devices and Radiological Health,
U.S. Food and Drug Administration.
No official support or endorsement by the Food and Drug Administration of this presentation is intended or should be inferred.
September, 2005
2
Outline
Introduction: examples, diagnostic accuracy, verification bias
I. Ratio of true positive rates and ratio of false positive rates
II. Multiple imputation
III. Types of missingness in subsets
Summary
3
Comparison of two qualitative tests, T1 and T2, or combinations of them
T1
Pos Neg
T2 Pos A B
Neg C D
N
Examples: • Cervical cancer: T1- Pap test (categorical values), T2 - HPV test (qualitative test); Reference method – colposcopy/biopsy
• Prostate cancer: T1 - DRE (qualitative test), T2 - PSA (quantitative test with cutoff of 4 pg/mL); Reference method – biopsy;
• Abnormal cells on a Pap slide; T1 - Manual reading of a Pap slide; T2 - Computer-aided reading of a Pap slide;
Reference method – reading of a slide by Adjudication Committee
4
Diagnostic Accuracy of Medical test
Se
1- Sp
y1
x1
T1
Pair: Sensitivity = TPR Specificity = TNR (x1, y1), where x1 = FPR = 1 - Sp1
y1 = TPR = Se1
Pair: PLR1 = Se1/(1-Sp1) = y1/ x1 = tangent of θ1
(slope of line)
related to PPV
NLR1 = (1-Se1)/Sp1 = (1-y1)/ (1-x1) = tangent of θ2
(slope of line)
related to NPV
θ1
θ2
5
Boolean Combinations “OR” and “AND” of T1 and Random Test
Se
1- Sp
y1
x1
T1
θ1
θ2 y-y1 = NLR1 * (x-x1) y-y1 = (1-y1)/(1-x1) * (x-x1)
Random Test: + with prob. α - with prob. 1-α
Combination ORSeOR = Se1 + (1-Se1)*α = y1 + (1-y1)*αSpOR = Sp1*(1-α) = (1-x1)*(1-α)
T1 OR Random Test
NLR(T1 OR Random Test) = (1-y1)/(1-x1)
6
Boolean Combinations “OR” and “AND” of T1 and Random Test
Se
1- Sp
y1
x1
T1
θ1
θ2 y-y1 = PLR1 * (x-x1) y-y1 = y1/x1 * (x-x1)
Random Test: + with prob. α - with prob. 1-α
Combination ANDSeAND = Se1*α = y1*αSpAND = Sp1 +(1-Sp1)*(1-α) = (1-x1) + x1*(1-α)
T1 AND Random Test
PLR(T1 AND Random Test) = y1/x1
7
Comparing Medical Tests
Se
1- Sp
T1
PPV>PPV1
NPV>NPV1
PPV<PPV1
NPV>NPV1
PPV>PPV1
NPV<NPV1
PPV<PPV1
NPV<NPV1
More detail in: Biggerstaff, B.J. Comparing diagnostic tests: a simple graphic using likelihood ratios. Statistics in Medicine 2000, 19 :649-663
8
Formal Model: Prospective study, comparison of two qualitative tests,T1 and T2, or combinations of them
T1
Pos Neg
T2 Pos A B
Neg C D
N
T1
Pos Neg
T2 Pos a1 b1
Neg c1 d1
N1
T1
Pos Neg
T2 Pos a0 b0
Neg c0 d0
N0
Disease D+ Non-Disease D-
a1 + a0 = A; b1 + b0 = B; c1 + c0 = C; d1 + d0 = D, N1 + N0 = N
9
Pap test
Pos Neg
T2 Pos 43 285
Neg 71 6,601
7,000
T1
Pos Neg
T2Pos 30 270 300
Neg 70
100
Disease D+ Non-Disease D-
T1
Pos Neg
T2Pos 13 15 28
Neg 1
14
Example: condition of interest -cervical disease, T1- Pap test, T2 – biomarker, Reference- colposcopy/biopsy
10
Verification Bias
In studies for the evaluation of diagnostic devices, sometimes the reference (“gold”) standard is not applied to all study subjects.
If the process by which subjects were selected for verification depends on the results of the medical tests, then the statistical analysis of accuracies of these medical tests without the proper corrections is biased. This bias is often referred as verification bias (or variants of it, work-up bias, referral bias, and validation bias).
11
Estimates of sensitivities and specificities based only on verified results are biased.
T1
Pos Neg
T2 Pos A B
Neg C D
N
T1
Pos Neg
T2Pos a0 b0
Neg c0 [d0]
[N0]
Disease D+ Non-Disease D-T1
Pos Neg
T2Pos a1 b1
Neg c1 [d1]
[N1]
I. Ratio of True Positive Rates and Ratio of False Positive Rates
Not all subjects (or none) with both negative results were verified by the Reference method.
Ratio of sensitivities and ratio of false positive rates are unbiased2.
2 Schatzkin, A., Connor, R.J., Taylor, P.R., and Bunnag, B. “Comparing new and old screening tests when a reference procedure cannot be performed on all screeners”. American Journal of Epidemiology 1987, Vol. 125, N.4, p.672-678
2 1 1
1 11
0 02
0 01
ˆ ( )ˆ ( )
ˆ1 ( )ˆ1 ( )
Se T a b
a cSe T
a bSp T
a cSp T
12
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Statement of the problem:
Se2/Se1 = y2/y1 = Ry (1-Sp2)/(1-Sp1) = x2/x1 = Rx
Can we make conclusions about effectiveness of Test2 if we know only ratio of True Positive rates and ratio of False Positive rates between Test1 and Test2?
For sake of simplicity, consider that Test2 has higher theoretical sensitivity, Se2/Se1=Ry >1 (true parameters not estimates)
13
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Se
1- Sp
y1
x1
T1
A) Se2/Se1=Ry >1 (increase in sensitivity)
(1-Sp2)/(1-Sp1) = Rx <1 (decrease in false positive rates)
For any Test1,Test2 is effective (superior than Test1)
14
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Se
1- Sp
y1
x1
T1
B) Se2/Se1=Ry >1 (increase in sensitivity); (1-Sp2)/(1-Sp1) = Rx >1 (increase in false positive rates); Ry >= Rx > 1
For any Test1, Test2 is effective (superior than Test1 because PPV and NPV of Test2 are higher than ones of Test1 )
It is easy to show that PLR2=Se2/(1-Sp2)=Ry/Rx*PLR1
and then PLR2 >= PLR1
15
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Pap test
Pos Neg
T2 Pos 43 285
Neg 71 6,601
7,000
T1
Pos Neg
T2Pos 30 270 300
Neg 70
100
Disease D+ Non-Disease D-
T1
Pos Neg
T2Pos 13 15 28
Neg 1
14
Example: condition of interest -cervical disease, T1- Pap test, T2 – biomarker, Reference- colposcopy/biopsy
2
1
ˆ 282.0
ˆ 14
Se
Se 2
1
ˆ1 3003.0
ˆ 1001
Sp
Sp
16
Se
1- Sp
y1
x1
T1
T1 OR Random Test
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
C) Se2/Se1=Ry >1 (increase in sensitivity);
(1-Sp2)/(1-Sp1) = Rx >1 (increase in false positive rates);
Ry < Rx
Increase in false positive rates is higher than increase in true positive rates
Can we make conclusions about effectiveness of Test2 ?
17
Se
1- Sp
y1
x1
T1
T1 OR Random Test
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Theorem:Test2 is above the line of combination T1 OR Random Testif(Rx-1)/(Ry-1) < PLR1/NLR1
Example, Ry=2 and Rx=3.(Rx-1)/(Ry-1)=(3-1)/(2-1)=2.
Depends on accuracy of T1: if PLR1/NLR1> 2 then T2 is superior for confirming absence of disease (NPV↑, PPV↓); if PLR1/NLR1< 2 then T2 is inferior overall (NPV↓, PPV↓).
18
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
C) Se2/Se1=Ry >1 (increase in sensitivity); (1-Sp2)/(1-Sp1) = Rx >1 (increase in false positive rates); Ry < Rx
(increase in FPR is higher than increase in TPR)
For situation C:
In order to do conclusions about effectiveness of Test2, we should have information about the diagnostic accuracy of Test1.
19
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Se2/Se1=Ry>1 then Se1 <=1/Ry; (1-Sp2)/(1-Sp1)=Rx >1 then (1-Sp1)<=1/Rx
1/Ry
1/Rx
1
1
1 11
11
x
x yx y
y
Ry
R RR Rx
R
Hyperbola
If T1 is in the green area,then T2 is superior for confirming absence of Disease (higher NPV and lower PPV)
If T1 is in the red area,then T2 is inferior overall (lower NPV and lower PPV)
20
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Summary:
If in the clinical study of comparing accuracies of two tests, Test2 and Test1, it is anticipated a statistically higher increase in TP rates of Test2 than increase in FP rates then conclusions about effectiveness of Test2 can be made without information about diagnostic accuracy of Test1.
In most practical situations, when it is anticipated that increase in FP rates of Test2 is higher than increase in TP rates (or not enough sample size to demonstrate that increase in TP is statistically higher than increase in FP), then information about diagnostic accuracy of Test1 is needed in order to make conclusions about effectiveness of Test2.
21
If a random sample of the subjects with both negative tests results are verified by reference standard then the unbiased estimates of sensitivities and specificities for Test1 and Test2 can be constructed.
T1
Pos Neg
T2 Pos A B
Neg C D
N
T1
Pos Neg
T2Pos a0 b0
Neg c0 [d0]
[N0]
Disease D+ Non-Disease D-T1
Pos Neg
T2Pos a1 b1
Neg c1 [d1]
[N1]
II. Verification Bias: Subjects Negative on Both Tests
22
II. Verification Bias: Bias Correction
Verification Bias Correction Procedures:1. Begg, C.B., Greenes, R.A. (1983) Assessment of diagnostic tests when disease
verification is subject to selection bias. Biometrics 39, 207-215.2. Hawkins, D.M., Garrett, J.A., Stephenson, B. (2001) Some issues in resolution of
diagnostic tests using an imperfect gold standard. Statistics in Medicine 2001; 20, 1987-2001.
Multiple Imputation• The absence of the disease status for some subjects can be considered as a
problem of missing data.• Multiple imputation is a Monte Carlo simulation where the missing
disease status of the subjects are replaced by simulated plausible values based on the observed data, each of the imputed datasets is analyzed separately and diagnostic accuracies of tests are evaluated. Then the results are combined to produce the estimates and confidence intervals that incorporate uncertainties related to the missing verified disease status for some subjects.
23
T1
Pos Neg
T2 Pos A B
Neg C D
N
II. Verification Bias: Subjects Negative on Both Tests (cont.)
Usually, according to the study protocol, all subjects from the subsets A, B and C should have the verified disease status and the verification bias is related to the subjects to whom both tests results are negative.
In practice, sometimes, not all subjects from the subsets A, B, and C may be compliant about disease verification:
T1
Pos Neg
T2 Pos A
70%
B
50%
Neg C
30%
D
N
Verification Bias !
24
III. Different Types of Missingness
In order to correctly adjust for verification bias, the type of missingness should be investigated.
Missing data mechanisms:
Missing Completely At Random (MCAR) – missingness is unrelated to the values of any variables (whether the disease status or observed variables);
Missing At Random (MAR) – missingness is unrelated to the disease status but may be related to the observed values of other variables.
For details, see Little, R.J.A and Rubin, D. (1987) Statistical Analysis with Missing Data. New York: John Wiley.
25
III. Different Types of MissingnessExample: Prospective study for prostate cancer. 5,000 men were screened with digital rectal exam (DRE) and prostate specific antigen (PSA) assay. Results of DRE are Positive, Negative. PSA, a quantitative test, is dichotomized by threshold of 4 ng/ml: Positive (PSA > 4), Negative (PSA ≤ 4). D+ = Prostate cancer; D- = No prostate cancer (ref. standard = biopsy).
DRE+ DRE-
PSA+ 150105 biopsies (70%)
750375 biopsies (50%)
PSA- 25075 biopsies (30%)
3,850No biopsies
5,000
26
Subjects with Verified Disease Status
DRE+ DRE-
PSA+ 60 110
PSA- 25 n/a
DRE+ DRE-
PSA+ 45 265
PSA- 50 n/a
DRE+ DRE-
PSA+ 150
105 biopsies (70%)
750
375 biopsies (50%)
PSA- 250
75 biopsies (30%)
3,850
No biopsies
D+ (Positive Biopsy) D- (Negative Biopsy)
All Subjects
27
• Do the subjects without biopsies differ from the subjects with
biopsies?
Propensity score = conditional probability that the subject underwent the verification of disease (biopsy in this example) given a collection of observed covariates (the quantitative value of the PSA test, Age, Race and so on).
Statistical modeling of relationship between membership in the group of verified subjects by logistic regression:
outcome – underwent verification (biopsy): yes, no predictor – PSAQuantitative, covariates.
III. Different Types of Missingness (cont.)
28
III. Different Types of Missingness (cont.)
DRE+ DRE-
PSA+ 150
105 biopsies (70%)
750
375 biopsies (50%)
PSA- 250
75 biopsies (30%)
3,850
No biopsies
5,000
For subgroup A (PSA+, DRE+), probability that a subject has a missed biopsy does not appear to depend neither on PSA values nor on the observed covariates (age, race). Type of missingness - Missing Completely At Random.Similar, for group B (PSA+, DRE-).
29
III. Different Types of Missingness (cont.)
DRE+ DRE-
PSA+ 150
105 biopsies (70%)
750
375 biopsies (50%)
PSA- 250
75 biopsies (30%)
3,850
No biopsies
5,000
For subgroup C (PSA-, DRE+), probability that a subject has a missed biopsy does depend on the quantitative value of PSA. So, the value of the PSA is a significant predictor for biopsy missingness in this subgroup (the larger value of PSA, the lower probability of missing biopsy). Type of missingness - Missing At Random.
30
D+ D-
DRE+ DRE-
PSA+ 86 220
PSA- 50
83 (biased)
n/a
DRE+ DRE-
PSA+ 64 530
PSA- 200
167 (biased)
n/a
Adjustment for verification without proper investigation of type of missingness (biased estimates):
III. Different Types of missingness (cont.)
Adjustment for verification taking into account different types of missingness (unbiased estimates):
ˆ ˆ( ) 306 1 ( ) 5941.81 2.57
ˆ ˆ169 231( ) 1 ( )
Se PSA Sp PSA
Se DRE Sp DRE
ˆ ˆ( ) 306 1 ( ) 5942.25 2.25
ˆ ˆ136 264( ) 1 ( )
Se PSA Sp PSA
Se DRE Sp DRE
31
Correct adjustment for verification bias produces the estimates demonstrating that an increase in FP rates for the New test (PSA) is about the same as an increase in TP rates while incorrect adjustment for verification bias showed that the increase in FP rates was larger than the increase in TP rates.
So, naïve estimation of the risk for the subgroup C based on the assumption that the missing results of biopsy were Missing Completely At Random produces biased estimation of the performance of the New PSA test (underestimation of the performance of the New test).
III. Different Types of missingness (cont.)
For proper adjustment, information on the distribution of test results in the subjects who are not selected for verification should be available.
32
Summary In most practical situations, estimation of only ratios of True Positive and False Positive rates does not allow one to make conclusions about effectiveness of the test.
The absence of disease status can be considered as the problem of missing data. Multiple imputation technique can be used for correction of verification bias. Information on the distribution of test results in the subjects who are not selected for verification should be available.
The investigation of the type of missingness should be done for obtaining unbiased estimates of performances of medical tests. All subsets of subjects should be checked for missing disease status.
Precision of the estimated diagnostic accuracies depends primarily on the number of verified cases available for statistical analysis.
33
References
1. Begg C.B. and Greenes R.A. (1983). Assessment of diagnostic tests when disease verification is subject to selection. Biometrics, 39, 207-215.
2. Biggerstaff, B.J. (2000) Comparing diagnostic tests: a simple graphic using likelihood ratios. Statistics in Medicine 2000, 19 :649-663 3. Hawkins, DM, JA Garrett and B Stephenson. (2001) Some issues in resolution of
diagnostic tests using an imperfect gold standard. Statistics in Medicine; 20:1987-2001.
4. Kondratovich MV (2003) Verification bias in the evaluation of diagnostic tests. Proceedings of the 2003 Joint Statistical Meeting, Biopharmaceutical Section, San Francisco, CA.
5. Ransohoff DF, Feinstein AR. (1978) Problems of spectrum and bias in evaluating the
efficacy of diagnostic tests. New England Journal Of Medicine. 299: 926-9306. Schatzkin A., Connor R.J., Taylor P.R., and Bunnag B. (1987) Comparing new and old screening tests when a reference procedure cannot be performed on all screeners. American Journal of Epidemiology, vol.125, N.4, p. 672- 678.7. Zhou X. (1994) Effect of verification bias on positive and negative predictive values.
Statistics in Medicine; 13; 1737-17458. Zhou X. (1998) Correcting for verification bias in studies of a diagnostic test’s
accuracy. Statistical Methods in Medical Research; 7; p.337-353.9. http://www.fda.gov/cdrh/pdf/p930027s004b.pdf