Download - Comparing Diagnostic Accuracies of Two Tests in Studies with Verification Bias

1

Comparing Diagnostic Accuracies of Two Tests in

Studies with Verification Bias

Marina Kondratovich, Ph.D.

Division of Biostatistics,

Center for Devices and Radiological Health,

U.S. Food and Drug Administration.

No official support or endorsement by the Food and Drug Administration of this presentation is intended or should be inferred.

September, 2005

2

Outline

Introduction: examples, diagnostic accuracy, verification bias

I. Ratio of true positive rates and ratio of false positive rates

II. Multiple imputation

III. Types of missingness in subsets

Summary

3

Comparison of two qualitative tests, T1 and T2, or combinations of them

T1

Pos Neg

T2 Pos A B

Neg C D

N

Examples: • Cervical cancer: T1- Pap test (categorical values), T2 - HPV test (qualitative test); Reference method – colposcopy/biopsy

• Prostate cancer: T1 - DRE (qualitative test), T2 - PSA (quantitative test with cutoff of 4 pg/mL); Reference method – biopsy;

• Abnormal cells on a Pap slide; T1 - Manual reading of a Pap slide; T2 - Computer-aided reading of a Pap slide;

Reference method – reading of a slide by Adjudication Committee

4

Diagnostic Accuracy of Medical test

Se

1- Sp

y1

x1

T1

Pair: Sensitivity = TPR Specificity = TNR (x1, y1), where x1 = FPR = 1 - Sp1

y1 = TPR = Se1

Pair: PLR1 = Se1/(1-Sp1) = y1/ x1 = tangent of θ1

(slope of line)

related to PPV

NLR1 = (1-Se1)/Sp1 = (1-y1)/ (1-x1) = tangent of θ2

(slope of line)

related to NPV

θ1

θ2

5

Boolean Combinations “OR” and “AND” of T1 and Random Test

Se

1- Sp

y1

x1

T1

θ1

θ2 y-y1 = NLR1 * (x-x1) y-y1 = (1-y1)/(1-x1) * (x-x1)

Random Test: + with prob. α - with prob. 1-α

Combination ORSeOR = Se1 + (1-Se1)*α = y1 + (1-y1)*αSpOR = Sp1*(1-α) = (1-x1)*(1-α)

T1 OR Random Test

NLR(T1 OR Random Test) = (1-y1)/(1-x1)

6

Boolean Combinations “OR” and “AND” of T1 and Random Test

Se

1- Sp

y1

x1

T1

θ1

θ2 y-y1 = PLR1 * (x-x1) y-y1 = y1/x1 * (x-x1)

Random Test: + with prob. α - with prob. 1-α

Combination ANDSeAND = Se1*α = y1*αSpAND = Sp1 +(1-Sp1)*(1-α) = (1-x1) + x1*(1-α)

T1 AND Random Test

PLR(T1 AND Random Test) = y1/x1

7

Comparing Medical Tests

Se

1- Sp

T1

PPV>PPV1

NPV>NPV1

PPV<PPV1

NPV>NPV1

PPV>PPV1

NPV<NPV1

PPV<PPV1

NPV<NPV1

More detail in: Biggerstaff, B.J. Comparing diagnostic tests: a simple graphic using likelihood ratios. Statistics in Medicine 2000, 19 :649-663

8

Formal Model: Prospective study, comparison of two qualitative tests,T1 and T2, or combinations of them

T1

Pos Neg

T2 Pos A B

Neg C D

N

T1

Pos Neg

T2 Pos a1 b1

Neg c1 d1

N1

T1

Pos Neg

T2 Pos a0 b0

Neg c0 d0

N0

Disease D+ Non-Disease D-

a1 + a0 = A; b1 + b0 = B; c1 + c0 = C; d1 + d0 = D, N1 + N0 = N

9

Pap test

Pos Neg

T2 Pos 43 285

Neg 71 6,601

7,000

T1

Pos Neg

T2Pos 30 270 300

Neg 70

100


T1

Pos Neg

T2Pos 13 15 28

Neg 1

14

Example: condition of interest -cervical disease, T1- Pap test, T2 – biomarker, Reference- colposcopy/biopsy

10

Verification Bias

In studies for the evaluation of diagnostic devices, sometimes the reference (“gold”) standard is not applied to all study subjects.

If the process by which subjects were selected for verification depends on the results of the medical tests, then the statistical analysis of accuracies of these medical tests without the proper corrections is biased. This bias is often referred as verification bias (or variants of it, work-up bias, referral bias, and validation bias).

11

Estimates of sensitivities and specificities based only on verified results are biased.

T1

Pos Neg

T2 Pos A B

Neg C D

N

T1

Pos Neg

T2Pos a0 b0

Neg c0 [d0]

[N0]

Disease D+ Non-Disease D-T1

Pos Neg

T2Pos a1 b1

Neg c1 [d1]

[N1]

I. Ratio of True Positive Rates and Ratio of False Positive Rates

Not all subjects (or none) with both negative results were verified by the Reference method.

Ratio of sensitivities and ratio of false positive rates are unbiased2.

2 Schatzkin, A., Connor, R.J., Taylor, P.R., and Bunnag, B. “Comparing new and old screening tests when a reference procedure cannot be performed on all screeners”. American Journal of Epidemiology 1987, Vol. 125, N.4, p.672-678

2 1 1

1 11

0 02

0 01

ˆ ( )ˆ ( )

ˆ1 ( )ˆ1 ( )

Se T a b

a cSe T

a bSp T

a cSp T

12

I. Ratio of TP Rates and Ratio of FP Rates (cont.)

Statement of the problem:

Se2/Se1 = y2/y1 = Ry (1-Sp2)/(1-Sp1) = x2/x1 = Rx

Can we make conclusions about effectiveness of Test2 if we know only ratio of True Positive rates and ratio of False Positive rates between Test1 and Test2?

For sake of simplicity, consider that Test2 has higher theoretical sensitivity, Se2/Se1=Ry >1 (true parameters not estimates)

13


Se

1- Sp

y1

x1

T1

A) Se2/Se1=Ry >1 (increase in sensitivity)

(1-Sp2)/(1-Sp1) = Rx <1 (decrease in false positive rates)

For any Test1,Test2 is effective (superior than Test1)

14


Se

1- Sp

y1

x1

T1

B) Se2/Se1=Ry >1 (increase in sensitivity); (1-Sp2)/(1-Sp1) = Rx >1 (increase in false positive rates); Ry >= Rx > 1

For any Test1, Test2 is effective (superior than Test1 because PPV and NPV of Test2 are higher than ones of Test1 )

It is easy to show that PLR2=Se2/(1-Sp2)=Ry/Rx*PLR1

and then PLR2 >= PLR1

15


Pap test

Pos Neg

T2 Pos 43 285

Neg 71 6,601

7,000

T1

Pos Neg

T2Pos 30 270 300

Neg 70

100


T1

Pos Neg

T2Pos 13 15 28

Neg 1

14

Example: condition of interest -cervical disease, T1- Pap test, T2 – biomarker, Reference- colposcopy/biopsy

2

1

ˆ 282.0

ˆ 14

Se

Se 2

1

ˆ1 3003.0

ˆ 1001

Sp

Sp

16

Se

1- Sp

y1

x1

T1

T1 OR Random Test


C) Se2/Se1=Ry >1 (increase in sensitivity);

(1-Sp2)/(1-Sp1) = Rx >1 (increase in false positive rates);

Ry < Rx

Increase in false positive rates is higher than increase in true positive rates

Can we make conclusions about effectiveness of Test2 ?

17

Se

1- Sp

y1

x1

T1

T1 OR Random Test


Theorem:Test2 is above the line of combination T1 OR Random Testif(Rx-1)/(Ry-1) < PLR1/NLR1

Example, Ry=2 and Rx=3.(Rx-1)/(Ry-1)=(3-1)/(2-1)=2.

Depends on accuracy of T1: if PLR1/NLR1> 2 then T2 is superior for confirming absence of disease (NPV↑, PPV↓); if PLR1/NLR1< 2 then T2 is inferior overall (NPV↓, PPV↓).

18


C) Se2/Se1=Ry >1 (increase in sensitivity); (1-Sp2)/(1-Sp1) = Rx >1 (increase in false positive rates); Ry < Rx

(increase in FPR is higher than increase in TPR)

For situation C:

In order to do conclusions about effectiveness of Test2, we should have information about the diagnostic accuracy of Test1.

19


Se2/Se1=Ry>1 then Se1 <=1/Ry; (1-Sp2)/(1-Sp1)=Rx >1 then (1-Sp1)<=1/Rx

1/Ry

1/Rx

1

1

1 11

11

x

x yx y

y

Ry

R RR Rx

R

Hyperbola

If T1 is in the green area,then T2 is superior for confirming absence of Disease (higher NPV and lower PPV)

If T1 is in the red area,then T2 is inferior overall (lower NPV and lower PPV)

20


Summary:

If in the clinical study of comparing accuracies of two tests, Test2 and Test1, it is anticipated a statistically higher increase in TP rates of Test2 than increase in FP rates then conclusions about effectiveness of Test2 can be made without information about diagnostic accuracy of Test1.

In most practical situations, when it is anticipated that increase in FP rates of Test2 is higher than increase in TP rates (or not enough sample size to demonstrate that increase in TP is statistically higher than increase in FP), then information about diagnostic accuracy of Test1 is needed in order to make conclusions about effectiveness of Test2.

21

If a random sample of the subjects with both negative tests results are verified by reference standard then the unbiased estimates of sensitivities and specificities for Test1 and Test2 can be constructed.

T1

Pos Neg

T2 Pos A B

Neg C D

N

T1

Pos Neg

T2Pos a0 b0

Neg c0 [d0]

[N0]

Disease D+ Non-Disease D-T1

Pos Neg

T2Pos a1 b1

Neg c1 [d1]

[N1]

II. Verification Bias: Subjects Negative on Both Tests

22

II. Verification Bias: Bias Correction

Verification Bias Correction Procedures:1. Begg, C.B., Greenes, R.A. (1983) Assessment of diagnostic tests when disease

verification is subject to selection bias. Biometrics 39, 207-215.2. Hawkins, D.M., Garrett, J.A., Stephenson, B. (2001) Some issues in resolution of

diagnostic tests using an imperfect gold standard. Statistics in Medicine 2001; 20, 1987-2001.

Multiple Imputation• The absence of the disease status for some subjects can be considered as a

problem of missing data.• Multiple imputation is a Monte Carlo simulation where the missing

disease status of the subjects are replaced by simulated plausible values based on the observed data, each of the imputed datasets is analyzed separately and diagnostic accuracies of tests are evaluated. Then the results are combined to produce the estimates and confidence intervals that incorporate uncertainties related to the missing verified disease status for some subjects.

23

T1

Pos Neg

T2 Pos A B

Neg C D

N

II. Verification Bias: Subjects Negative on Both Tests (cont.)

Usually, according to the study protocol, all subjects from the subsets A, B and C should have the verified disease status and the verification bias is related to the subjects to whom both tests results are negative.

In practice, sometimes, not all subjects from the subsets A, B, and C may be compliant about disease verification:

T1

Pos Neg

T2 Pos A

70%

B

50%

Neg C

30%

D

N

Verification Bias !

24

III. Different Types of Missingness

In order to correctly adjust for verification bias, the type of missingness should be investigated.

Missing data mechanisms:

Missing Completely At Random (MCAR) – missingness is unrelated to the values of any variables (whether the disease status or observed variables);

Missing At Random (MAR) – missingness is unrelated to the disease status but may be related to the observed values of other variables.

For details, see Little, R.J.A and Rubin, D. (1987) Statistical Analysis with Missing Data. New York: John Wiley.

25

III. Different Types of MissingnessExample: Prospective study for prostate cancer. 5,000 men were screened with digital rectal exam (DRE) and prostate specific antigen (PSA) assay. Results of DRE are Positive, Negative. PSA, a quantitative test, is dichotomized by threshold of 4 ng/ml: Positive (PSA > 4), Negative (PSA ≤ 4). D+ = Prostate cancer; D- = No prostate cancer (ref. standard = biopsy).

DRE+ DRE-

PSA+ 150105 biopsies (70%)

750375 biopsies (50%)

PSA- 25075 biopsies (30%)

3,850No biopsies

5,000

26

Subjects with Verified Disease Status

DRE+ DRE-

PSA+ 60 110

PSA- 25 n/a

DRE+ DRE-

PSA+ 45 265

PSA- 50 n/a

DRE+ DRE-

PSA+ 150

105 biopsies (70%)

750

375 biopsies (50%)

PSA- 250

75 biopsies (30%)

3,850

No biopsies

D+ (Positive Biopsy) D- (Negative Biopsy)

All Subjects

27

• Do the subjects without biopsies differ from the subjects with

biopsies?

Propensity score = conditional probability that the subject underwent the verification of disease (biopsy in this example) given a collection of observed covariates (the quantitative value of the PSA test, Age, Race and so on).

Statistical modeling of relationship between membership in the group of verified subjects by logistic regression:

outcome – underwent verification (biopsy): yes, no predictor – PSAQuantitative, covariates.

III. Different Types of Missingness (cont.)

28


DRE+ DRE-

PSA+ 150

105 biopsies (70%)

750

375 biopsies (50%)

PSA- 250

75 biopsies (30%)

3,850

No biopsies

5,000

For subgroup A (PSA+, DRE+), probability that a subject has a missed biopsy does not appear to depend neither on PSA values nor on the observed covariates (age, race). Type of missingness - Missing Completely At Random.Similar, for group B (PSA+, DRE-).

29


DRE+ DRE-

PSA+ 150

105 biopsies (70%)

750

375 biopsies (50%)

PSA- 250

75 biopsies (30%)

3,850

No biopsies

5,000

For subgroup C (PSA-, DRE+), probability that a subject has a missed biopsy does depend on the quantitative value of PSA. So, the value of the PSA is a significant predictor for biopsy missingness in this subgroup (the larger value of PSA, the lower probability of missing biopsy). Type of missingness - Missing At Random.

30

D+ D-

DRE+ DRE-

PSA+ 86 220

PSA- 50

83 (biased)

n/a

DRE+ DRE-

PSA+ 64 530

PSA- 200

167 (biased)

n/a

Adjustment for verification without proper investigation of type of missingness (biased estimates):

III. Different Types of missingness (cont.)

Adjustment for verification taking into account different types of missingness (unbiased estimates):

ˆ ˆ( ) 306 1 ( ) 5941.81 2.57

ˆ ˆ169 231( ) 1 ( )

Se PSA Sp PSA

Se DRE Sp DRE

ˆ ˆ( ) 306 1 ( ) 5942.25 2.25

ˆ ˆ136 264( ) 1 ( )

Se PSA Sp PSA

Se DRE Sp DRE

31

Correct adjustment for verification bias produces the estimates demonstrating that an increase in FP rates for the New test (PSA) is about the same as an increase in TP rates while incorrect adjustment for verification bias showed that the increase in FP rates was larger than the increase in TP rates.

So, naïve estimation of the risk for the subgroup C based on the assumption that the missing results of biopsy were Missing Completely At Random produces biased estimation of the performance of the New PSA test (underestimation of the performance of the New test).

III. Different Types of missingness (cont.)

For proper adjustment, information on the distribution of test results in the subjects who are not selected for verification should be available.

32

Summary In most practical situations, estimation of only ratios of True Positive and False Positive rates does not allow one to make conclusions about effectiveness of the test.

The absence of disease status can be considered as the problem of missing data. Multiple imputation technique can be used for correction of verification bias. Information on the distribution of test results in the subjects who are not selected for verification should be available.

The investigation of the type of missingness should be done for obtaining unbiased estimates of performances of medical tests. All subsets of subjects should be checked for missing disease status.

Precision of the estimated diagnostic accuracies depends primarily on the number of verified cases available for statistical analysis.

33

References

1. Begg C.B. and Greenes R.A. (1983). Assessment of diagnostic tests when disease verification is subject to selection. Biometrics, 39, 207-215.

2. Biggerstaff, B.J. (2000) Comparing diagnostic tests: a simple graphic using likelihood ratios. Statistics in Medicine 2000, 19 :649-663 3. Hawkins, DM, JA Garrett and B Stephenson. (2001) Some issues in resolution of

diagnostic tests using an imperfect gold standard. Statistics in Medicine; 20:1987-2001.

4. Kondratovich MV (2003) Verification bias in the evaluation of diagnostic tests. Proceedings of the 2003 Joint Statistical Meeting, Biopharmaceutical Section, San Francisco, CA.

5. Ransohoff DF, Feinstein AR. (1978) Problems of spectrum and bias in evaluating the

efficacy of diagnostic tests. New England Journal Of Medicine. 299: 926-9306. Schatzkin A., Connor R.J., Taylor P.R., and Bunnag B. (1987) Comparing new and old screening tests when a reference procedure cannot be performed on all screeners. American Journal of Epidemiology, vol.125, N.4, p. 672- 678.7. Zhou X. (1994) Effect of verification bias on positive and negative predictive values.

Statistics in Medicine; 13; 1737-17458. Zhou X. (1998) Correcting for verification bias in studies of a diagnostic test’s

accuracy. Statistical Methods in Medical Research; 7; p.337-353.9. http://www.fda.gov/cdrh/pdf/p930027s004b.pdf

http://www.fda.gov/cdrh/pdf/p930027s004b.pdf

Download - Comparing Diagnostic Accuracies of Two Tests in Studies with Verification Bias

Top Related