simultaneous comparison of sensitivity and specificity of two tests in the paired design: a...

9
STATISTICS IN MEDICINE Statist. Med. 2001; 20:907–915 Simultaneous comparison of sensitivity and specicity of two tests in the paired design: a straightforward graphical approach Robert G. Newcombe * University of Wales College of Medicine; Heath Park; Cardi; CF14 4XN; U.K. SUMMARY Often the performances of two binary diagnostic or screening tests are compared by applying them to the same set of subjects, some of whom are aected, some unaected. The McNemar test, and corresponding interval estimation methods, may be used to compare the sensitivity of the two tests, but this disregards both any observed dierence in specicity and its imprecision due to sampling variation. The suggested approach is to display point and interval estimates for a weighted mean f of the dierences in sensitivity and specicity between the two tests. The mixing parameter , which is allowed to range from 0 to 1, represents the prevalence in the population to which application is envisaged, together with the relative seriousness of false positives and false negatives. The condence interval for f is obtained by a simple extension of a closed-form method for the paired dierence of proportions, which has favourable coverage properties and is based on the Wilson single proportion score method. A plot of f against is readily obtained using a Minitab macro. Copyright ? 2001 John Wiley & Sons, Ltd. 1. INTRODUCTION The basic measures used to characterize the performance of a diagnostic or screening test are its sensitivity and specicity. Generally the unit of application is the individual subject, though similar principles apply more widely in much the same way, for example to single teeth or to manufactured components. It is assumed that the true state of the subject is a binary attribute, aected or unaected, determined according to some ‘gold standard’ criterion. The proposed test likewise yields a binary result, either positive (suggesting aected status) or negative (suggesting unaected). Then the sensitivity is the proportion of truly aected subjects who are classed as positive by the test. Conversely the specicity is the proportion of truly unaected subjects who are classed as negative by the test. Thus in the paired design in which two independent tests T1 and T2 are evaluated in relation to the gold standard G, we may compare T1 and T2 for sensitivity. In the notation of Table I, the sensitivities of the two tests are 1: and :1 , dot notation indicating summation with respect to the relevant sux. The dierence in sensitivity between the two tests is 1: - :1 = 12 - 21 * Correspondence to: Robert G. Newcombe, University of Wales College of Medicine, Heath Park, Cardi, CF14 4XN, U.K. Copyright ? 2001 John Wiley & Sons, Ltd. Accepted May 1999

Upload: robert-g-newcombe

Post on 06-Jul-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Simultaneous comparison of sensitivity and specificity of two tests in the paired design: a straightforward graphical approach

STATISTICS IN MEDICINEStatist. Med. 2001; 20:907–915

Simultaneous comparison of sensitivity and speci�city of twotests in the paired design: a straightforward graphical approach

Robert G. Newcombe∗

University of Wales College of Medicine; Heath Park; Cardi�; CF14 4XN; U.K.

SUMMARY

Often the performances of two binary diagnostic or screening tests are compared by applying themto the same set of subjects, some of whom are a�ected, some una�ected. The McNemar test, andcorresponding interval estimation methods, may be used to compare the sensitivity of the two tests,but this disregards both any observed di�erence in speci�city and its imprecision due to samplingvariation. The suggested approach is to display point and interval estimates for a weighted mean fof the di�erences in sensitivity and speci�city between the two tests. The mixing parameter �, whichis allowed to range from 0 to 1, represents the prevalence in the population to which application isenvisaged, together with the relative seriousness of false positives and false negatives. The con�denceinterval for f is obtained by a simple extension of a closed-form method for the paired di�erence ofproportions, which has favourable coverage properties and is based on the Wilson single proportionscore method. A plot of f against � is readily obtained using a Minitab macro. Copyright ? 2001 JohnWiley & Sons, Ltd.

1. INTRODUCTION

The basic measures used to characterize the performance of a diagnostic or screening test areits sensitivity and speci�city. Generally the unit of application is the individual subject, thoughsimilar principles apply more widely in much the same way, for example to single teeth or tomanufactured components. It is assumed that the true state of the subject is a binary attribute,a�ected or una�ected, determined according to some ‘gold standard’ criterion.The proposed test likewise yields a binary result, either positive (suggesting a�ected status)

or negative (suggesting una�ected). Then the sensitivity is the proportion of truly a�ectedsubjects who are classed as positive by the test. Conversely the speci�city is the proportionof truly una�ected subjects who are classed as negative by the test.Thus in the paired design in which two independent tests T1 and T2 are evaluated in relation

to the gold standard G, we may compare T1 and T2 for sensitivity. In the notation of Table I,the sensitivities of the two tests are �1: and �:1, dot notation indicating summation with respectto the relevant su�x. The di�erence in sensitivity between the two tests is �1:−�:1 = �12−�21

∗ Correspondence to: Robert G. Newcombe, University of Wales College of Medicine, Heath Park, Cardi�,CF14 4XN, U.K.

Copyright ? 2001 John Wiley & Sons, Ltd. Accepted May 1999

Page 2: Simultaneous comparison of sensitivity and specificity of two tests in the paired design: a straightforward graphical approach

908 R. G. NEWCOMBE

Table I. Notation for comparison of two independent tests in the paired design.

True proportions

A�ected Test 2+ve −ve Total

Test 1 +ve �11 �12 �1:−ve �21 �22 �2:Total �:1 �:2 1

Di�erence in sensitivity �1 = �1: − �:1 = �12 − �21

Una�ected Test 2+ve −ve Total

Test 1 +ve �11 �12 �1:−ve �21 �22 �2:Total �:1 �:2 1

Di�erence in speci�city �2 = �2: − �:2 = �21 − �12

Observed frequencies

A�ected Test 2+ve −ve Total

Test 1 +ve a11 a12 a1:−ve a21 a22 a2:Total a:1 a:2 M

Una�ected Test 2+ve −ve Total

Test 1 +ve b11 b12 b1:−ve b21 b22 b2:Total b:1 b:2 N −M

which is denoted by �1. Then the familiar McNemar method [1] may be used to test H0:�1 = 0.The test is based on the discordant cell frequencies a12 and a21. A p-value may be obtainedfrom an accumulation of precisely calculated tail probabilities, on either an ‘exact’ or mid-p[2; 3] basis. Alternatively an asymptotic 1 d.f. X 2 = (a12 − a21)2=(a12 + a21), or equivalent zstatistic, may be calculated; a continuity correction may be incorporated.Traditional methods of setting a con�dence interval for �1 are seriously awed, and much

more appropriate ones are developed in a previous article [4]. These may be adapted [4] topermit equivalence testing: a one-sided test of H0:�1 =� versus H1:�1¡� may be used, ortwo one-sided tests for ±�, for some speci�ed �. All the above methods can also be applied,independently, to the di�erence in speci�city, �2 = �2: − �:2 = �21 − �12.Lu and Bean [5] also considered the issue of testing equivalence of sensitivity. They com-

mented that equivalence should focus on both sensitivity and speci�city, and suggested thatequivalence in speci�city should be assessed by applying the same methods to a series ofnormal subjects. They noted that the relative importance of sensitivity and speci�city variesaccording to clinical context. The purpose of this article is to develop a simple approach thatpermits comparison of sensitivity and speci�city simultaneously.

2. EXAMPLES

Faecal occult blood tests (FOBTs) are potentially useful in screening asymptomatic populationsto facilitate detection of subclinical colorectal neoplasia. High sensitivity and speci�city area prerequisite to usefulness for a screening programme, but are not readily achieved. Thetests work by detecting substances present in small quantities of blood that may be releasedfrom lesions. Bleeding may arise not only from carcinomas but also from adenomatous polyps,which are not strictly malignant but have the potential for malignant transformation. An FOBT

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:907–915

Page 3: Simultaneous comparison of sensitivity and specificity of two tests in the paired design: a straightforward graphical approach

SENSITIVITY AND SPECIFICITY OF TWO TESTS 909

Table II. Comparison of three faecal occult blood tests:160 subjects had tests MH and HOII, 108 also had BMCA.

Sensitivity Speci�city

MH 14=24 58% 131=136 96%HOII 9=24 37% 118=136 87%

MH 7=16 44% 87=92 95%HOII 4=16 25% 81=92 88%BMCA 4=16 25% 82=92 89%

Table III. Comparison of two tests to predict pre-eclampsia.

Test Sensitivity Speci�city

Outcome any pre-eclampsiaIUK:Cr 42=63 67% 294=395 74%AST 14=63 22% 334=395 85%

Outcome proteinuric pre-eclampsiaIUK:Cr 16=20 80% 311=438 71%AST 5=20 25% 368=438 84%

may also yield a false positive result due to residues of certain dietary items, even thoughsubjects are instructed to avoid these in preparation for the test. Low sensitivity arises primarilybecause not all early lesions bleed continuously.It is convenient (though not ideal) to assess the performance of FOBTs by studying series

of patients whose clinical presentation indicates more invasive methods are required. Hopeet al. [6] applied three FOBTs, Monohaem (MH), Hemoccult II (HOII) and BM-Test ColonAlbumin (BMCA) to a series of patients who subsequently underwent colonoscopy, whichis here regarded as de�nitive. Results of MH and HOII were available for 160 patients, ofwhom 24 were a�ected; BMCA results were available for 108 of them, including 16 a�ected,as shown in Table II.Table III gives results for the ability of two tests: the inactive urinary kallikrein:creatinine

ratio (IUK:Cr) and the angiotensin sensitivity test (AST) to predict pre-eclampsia [7].

3. METHOD

We set up a simple display which weighs the clinical costs of false negatives and falsepositives. Economic costs of the two test procedures may alter greatly with the passage oftime and are not readily incorporated. De�ne a loss function taking the value c1 in the eventof a false negative result, c2 for a false positive, and 0 for a correctly classi�ed subject.This may be regarded as representing misclassi�cation regret; the loss associated with a truepositive is set at zero, even though this is a serious outcome, because this result shouldlead to optimal management. For any condition su�ciently serious to warrant considering a

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:907–915

Page 4: Simultaneous comparison of sensitivity and specificity of two tests in the paired design: a straightforward graphical approach

910 R. G. NEWCOMBE

screening programme, it is reasonable to postulate c1¿c2¿0. Let � denote the prevalence ofthe condition in any population to which application of test 1 or 2 is envisaged; this maybe very di�erent to M=N , the prevalence in the training set, and will often be low. Then theexpected loss due to imperfect performance of test T1 in relation to the gold standard G is

E1 =�(1− �1:)c1 + (1− �)(1− �2:)c2The corresponding loss for test T2 is

E2 =�(1− �:1)c1 + (1− �)(1− �:2)c2Then

E2 − E1 =�c1�1 + (1− �)c2�2Thus we consider a simple weighted mean of �1 and �2

f= ��1 + (1− �)�2where

�1− � =

c1�c2(1− �) ; that is �=

1

1 +c2(1− �)c1�

The mixing parameter �∈ [0; 1] represents an appropriate weighting of sensitivity and speci-�city, taking into account both the relative seriousness of the two types of errors and theprevalence. It depends on and represents the balance between c2=c1 and �, enabling the personapplying the �ndings to appraise the implications for the intended application. If the penaltyassociated with a false negative is the dominant consideration, then c2� c1. As c2=c1→ 0, with� held non-zero, �→ 1 and f reduces to �1. Conversely, if the prevalence is su�ciently low,the penalty associated with false positives may be the dominant consideration: as �→ 0, withc2=c1 held non-zero, �→ 0 and f reduces to �2. The linear function f eliminates dependenceon the arbitrary unit of the cost scale, and appropriately expresses the comparison betweentests 1 and 2 for any chosen �, directly on a proportion scale close to that used for sensitivityand speci�city themselves.We then calculate 100(1−�) per cent con�dence intervals for �i by an appropriate method.

References [8], [9] and [4] evaluate the properties of existing interval estimation methods forproportions and their di�erences, develop new improved methods for the unpaired and paireddi�erence cases, and use extensive simulations to evaluate coverage properties. Brie y, thesimplest con�dence interval method for the single proportion, which imputes an empiricalestimate of the asymptotic standard error, has de�cient coverage and often yields an intervalthat does not make sense. Wilson’s simple modi�cation [10] of letting the variance dependon the theoretical instead of the empirical value is a great improvement; it is a theoreticallysatisfactory ‘score’ method [11] which avoids aberrations, has favourable coverage properties[8; 12], and is highly tractable and exible.The de�ciencies of the simple method for the single proportion largely carry over into

similar de�ciencies when a similar approach is used for di�erences, and by implication otherlinear functions – indeed, the �rst satisfactory methods for the paired di�erence case [13; 4]were only published in 1998. A simple squaring and adding approach can readily be used to

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:907–915

Page 5: Simultaneous comparison of sensitivity and specificity of two tests in the paired design: a straightforward graphical approach

SENSITIVITY AND SPECIFICITY OF TWO TESTS 911

combine two Wilson intervals to produce a good, simple interval method for the di�erencebetween two independent proportions (method 10 of reference [9]). This process is analogousto the result

var(w1X1 + w2X2)=w21 var X1 + w22 var X2

for independent random variables X1 and X2, here with w1 = +1 and w2 = −1. Thus, supposethat �X1 and �X2 are means of independent random samples from Gaussian distributions withmeans �1 and �2 and common known variance �2. Then if (li; ui) is a 100(1 − �) per centcon�dence interval for �i (i=1; 2), the 100(1 − �) per cent con�dence interval for �1 − �2can be expressed as

�X1 − �X2 −√{( �X1 − l1)2 + (u2 − �X2)2} to �X1 − �X2 +√{(u1 − �X1)2 + ( �X2 − l2)2}

The interval for the di�erence between two proportions based on unpaired data, a=m− b=n, isobtained by using a=m instead of �X1 and b=n instead of �X2. This method then extends tothe paired case by incorporating an adjustment for non-independence, yielding method 10 ofreference [4]. These methods for the unpaired and paired di�erence cases are both shown tohave favourable coverage properties and are free of aberrations.Then, for i=1 and 2, let (Li; Ui) denote a 100(1 − �) per cent con�dence interval for �i,

calculated by method 10 of reference [4]. For any �∈ [0; 1], we may apply the above processagain, this time with w1 = �; w2 = 1− �, to obtain 100(1− �) per cent con�dence limits forf as

f̂ −√{�2(�̂1 − L1)2 + (1− �)2(�̂2 − L2)2}

and

f̂ +√{�2(U1 − �̂1)2 + (1− �)2(U2 − �̂2)2}

the positive value of each square root being taken. This reduces to (L1; U1) at �=1 and(L2; U2) at �=0.A plot of f together with these limits, for � ranging from 0 to 1, permits appraisal of

the degree of preference for one test over the other, for any assumed prevalence and lossratio. The reader can choose a value for � to re ect both the prevalence (whether an ac-tual local value or an assumed one) and their perceived relative seriousness of the twotypes of errors, and can discern how altering either of these would a�ect the preferencefor one test over the other. It is easily shown that @2L

@�2¡0¡@2U@�2 , so the region enclosed

by the plot is always concave. Usually M¡N=2, in which case the interval for the dif-ference in sensitivity �1 (plotted at �=1) is normally wider than the interval for the dif-ference in speci�city �2 (plotted at �=0). All computations are of closed form. A simpleMinitab 10.5 macro was used to produce Figures 1 to 4 as overlay plots, together withcorresponding numerical output. Point and interval estimates for �1 and �2 are applied to�=0; 0:01; 0:02; : : : ; 1: the exibility of the Minitab worksheet structure lends itself very wellto this process.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:907–915

Page 6: Simultaneous comparison of sensitivity and specificity of two tests in the paired design: a straightforward graphical approach

912 R. G. NEWCOMBE

Figure 1. Plot of f by � comparing faecal occult blood tests MH and HOII.

Figure 2. Plot of f by � comparing faecal occult blood tests MH and BMCA.

Figure 3. Plot of f by � comparing faecal occult blood tests HOII and BMCA.

4. RESULTS

For the faecal occult blood test data (Table II), Figures 1 to 3 are plots of f against �comparing each pair of tests in turn, using a 1 − �=0:95 con�dence level. Figure 1 showsa statistically signi�cant superiority in speci�city (�=0) of MH over HOII, though not insensitivity (�=1), even though �̂1¿�̂2. The interval for �2 is narrower than that for �1,

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:907–915

Page 7: Simultaneous comparison of sensitivity and specificity of two tests in the paired design: a straightforward graphical approach

SENSITIVITY AND SPECIFICITY OF TWO TESTS 913

Figure 4. Plot of f by � comparing the inactive urinary kallikrein:creatinine ratio and the angiotensinsensitivity test for prediction of pre-eclampsia.

because M=N is well below 0.5, and moreover the estimated speci�cities �̂2: and �̂:2 arerelatively close to 1. Because of concavity, the whole of the lower limit curve lies above 0precisely if both L1 and L2 do. This would be a rather stringent criterion for superiority, arisingwith frequency (�=2)2 when in reality �1 = �2 = 0 (to which should be added a further (�=2)2

chance of obtaining U1¡0 and U2¡0). The pattern shown in Figure 1 indicates considerableevidence that MH outperforms HOII; for any �60:55, the entire interval lies above 0.Figure 2 shows there is less �rm evidence for superiority of MH over BMCA, based on the

reduced sample size. There is no � for which the con�dence interval for f excludes negativevalues, though the lower limit comes very close to 0.In Figure 3, HOII and BMCA apparently have nearly identical sensitivity and speci�city,

but f is subject to considerable imprecision, particularly at high �. Using rejection of bothf=+0:1 and f=−0:1 at a one-sided 0.025 level as a criterion of equivalence, then only for0:0016�60:23 can the results be regarded as demonstrating equivalence.For the prediction of pre-eclampsia, Figure 4 shows that the much greater sensitivity of

IUK:Cr (Table III) outweighs the lower speci�city provided �¿0:29, though for very low�60:11, it is inferior to AST. A very similar plot is obtained when attention is restricted toproteinuric pre-eclampsia [14].

5. DISCUSSION

A naive approach to the comparison of two tests focuses on the overall accuracy, (a1:+b2:)=Nfor test 1, (a:1+b:2)=N for test 2. This approach is awed on two counts. It fails to incorporateweightings for the relative seriousness of the two types of errors and the prevalence and itpresents both tests in an unduly favourable light – by failing to take into account the degreeof agreement expected by chance, it incurs the problems which (in the rather di�erent contextof test reliability) Scott’s � [15] and Cohen’s � [16] were designed to obviate.Consideration of the relative merits of two tests tends to draw attention away from the

nature of the ‘gold standard’ criterion used. This is one that is intended to be de�nitive,though too expensive or invasive for general use – in the extreme case, the information mayonly become available post mortem. A test procedure such as histopathology that is regardedas a gold standard may nevertheless not be perfectly reproducible, though the implications ofthis are not pursued further here.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:907–915

Page 8: Simultaneous comparison of sensitivity and specificity of two tests in the paired design: a straightforward graphical approach

914 R. G. NEWCOMBE

To compare two proposed test T1 and T2 in relation to a gold standard G, two studydesigns are available. In the unpaired design N1 subjects are given tests T1 and G, and N2 aregiven tests T2 and G. In the paired design, as here, N subjects undergo all three tests. Eitheran unselected series of N subjects is studied, or else M subjects who are known to be a�ectedand N −M who are known not to be a�ected are studied. The paired design has the obviousadvantage that tests T1 and T2 are compared directly on the same subjects, and converselythe unpaired design is pro igate in its use of G. An unpaired comparison is particularlyunsatisfactory if retrospective in nature, as test G may well be applied quite di�erently in twoindependently performed studies. The unpaired design can only be recommended if for somereason tests T1 and T2 cannot be applied to the same individual or unit.We also assume tests T1 and T2 are applied independently. In particular, the situation in

which tests T1 and T2 are obtained by dichotomizing a quantitative or ordinal scale at twodi�erent points is a quite di�erent issue. In that situation, the trade-o� between sensitivityand speci�city may be expressed by the receiver operating characteristic (ROC) curve, thearea under which (AUROC) is linearly related to the familiar Mann–Whitney U statistic.Mee [17] has developed an interval estimate which is an improvement on that widely usedin the decision-making literature [18; 19]. This application of AUROC methods usually failsto incorporate weightings such as those represented by � and 1− �.Hitherto it has only been possible to set a comparison of speci�city alongside a comparison

of sensitivity. The method presented here goes beyond this and produces an intelligible displaylinking the two, and permitting interpretation in terms of both statistical signi�cance, for eitherH0:f=0 or H0:f=� 6=0, and clinical importance.

ACKNOWLEDGEMENTS

I thank Dr R.L. Hope and Miss Philippa Kyle for supplying original data, and Professor Alan Agresti,Dr Barry Nix, Ms Nicole Blackman and Dr Paul Seed for helpful comments.

REFERENCES

1. McNemar Q. Note on the sampling error of the di�erence between correlated proportions or percentages.Psychometrika 1947; 17:153–157.

2. Lancaster HO. The combination of probabilities arising from data in discrete distributions. Biometrika 1949;36:370–382.

3. Stone M. The role of signi�cance testing. Some data with a message. Biometrika 1969; 56:485–493.4. Newcombe RG. Improved con�dence intervals for the di�erence between binomial proportions based on paireddata. Statistics in Medicine 1998; 17:2635–2650.

5. Lu Y, Bean JA. On the sample size for one-sided equivalence of sensitivities based upon McNemar’s test.Statistics in Medicine 1995; 14:1831–1839.

6. Hope RL, Chu G, Hope AH, Newcombe RG, Gillespie PE, Williams SJ. A comparison of three faecal occultblood tests in the detection of colorectal neoplasia. Gut 1996; 39:722–725.

7. Kyle PM, Campbell S, Buckley D, Kissane J, de Swiet M, Albano J, Millar JG, Redman CWG. A comparisonof the inactive urinary kallikrein:creatinine ratio and the angiotensin sensitivity test for the prediction of pre-eclampsia. British Journal of Obstetrics and Gynaecology 1996; 103:981–987.

8. Newcombe RG. Two-sided con�dence intervals for the single proportion: comparison of seven methods.Statistics in Medicine 1998; 17:857–872.

9. Newcombe RG. Interval estimation for the di�erence between independent proportions. A comparative evaluationof eleven methods. Statistics in Medicine 1998; 17:873–890.

10. Wilson EB. Probable inference, the law of succession, and statistical inference. Journal of the AmericanStatistical Association 1927; 22:209–212.

11. Cox DR, Hinkley DV. Theoretical Statistics. Chapman and Hall: London, 1974.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:907–915

Page 9: Simultaneous comparison of sensitivity and specificity of two tests in the paired design: a straightforward graphical approach

SENSITIVITY AND SPECIFICITY OF TWO TESTS 915

12. Agresti A, Coull BA. Approximate is better than ‘exact’ for interval estimation of binomial proportions.American Statistician 1998; 52:119–126.

13. Tango T. Equivalence test and con�dence interval for the di�erence in proportions for the paired-sample design.Statistics in Medicine 1998; 17:891–908.

14. Kyle PM, Redman CWG, de Swiet M, Millar JG. A comparison of the inactive urinary kallikrein:creatinineratio and the angiotensin sensitivity test for the prediction of pre-eclampsia. British Journal of Obstetrics andGynaecology 1997; 104:971.

15. Scott WA. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly 1955;19:321–325.

16. Cohen J. A coe�cient of agreement for nominal scales. Educational and Psychological Measurement 1960;20:37–46.

17. Mee RW. Con�dence intervals for probabilities and tolerance regions based on a generalisation of the Mann–Whitney statistic. Journal of the American Statistical Association 1990; 85:793–800.

18. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve.Radiology 1982; 143:29–36.

19. Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derivedfrom the same cases. Radiology 1983; 148:839–843.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:907–915