comparing the predictive accuracy of models using a simple randomization test

Chemometrics and intelligent laboratory systems

Chemometrics and Intelligent Laboratory Systems 25 (1994) 313-323 ELSEVIER

Comparing the predictive accuracy of models using a simple randomization test

Hilko van der Voet Agriculhual Mathematics Group (GL W-DLO), P.O. Box 100,670O AC Wageningen, The Netherlands

Received 23 March 1994; accepted 16 August 1994

Abstract

A simple randomization t-test is proposed for testing the equality of performance of two prediction methods. The application of the test is shown to prevent unjustified conclusions about method superiority. Previous approaches to the problem of comparing predictive methods are discussed, and the proposed test is compared to other tests for paired data in a small simulation study. It is shown that the test can also be applied for classification problems where the predicted entity is qualitative rather than quantitative.

1. Introduction

A primary purpose of a model is to predict certain

traits of interest in the modelled system. This applies equally well to statistical models, e.g. a partial least

squares model predicting moisture in cheese from near-

infrared spectra, as to mechanisticmodels, e.g. complex dynamic crop models predicting maize yield in relation

to climatic conditions. It is even true of the simplest model of all, the mean of a set of measurements. This

is often interpreted, albeit implicitly, as the value to be expected for future observations under the same cir- cumstances.

Accuracy of predictions is therefore a central theme

when comparing different models for the same situation. Such models may differ because they use different

data for input, or they may use the same data but differ

radically in model structure, or they may be just minor variations within the same model family, for example

models with a different number of components in partial least squares regression.

The predictive ability of any model can be judged from the distribution of prediction errors obtained when the model is used to predict the response of independent cases. One characteristic of this distribution, the mean

squared error of prediction ( MSEP) , is often used as a simple criterion for the predictive ability of a model. If y denotes the trait of interest to be predicted by the model, and 9 the prediction from the model, then MSEP is defined by

MSEP=E(y-y^)* (I)

where E denotes the expectation over a target popula-

tion of individual cases. In this paper the situation is considered where a representative and independent sample of size n from this target population is available for evaluation, that is both reference values yi and predictions yi are known for i = 1,. . . ,n. MSEP is then esti-

mated by

njlSEP= (l/n) i(yi-$)’ (2) i=l

0169-7439/94/$07.00 0 1994 Ekevier Science B.V. All rights reserved SSDIO169-7439(94)00064-6

314 H. van der Voet I Chemometrics and Intelligent Laboratory Systems 25 (195’4) 313-323

The problem addressed in this paper is the comparison of two models using the distribution of prediction errors as a measure of predictive accuracy. The need for formal tests to make such comparisons is illustrated with some examples. A randomization t-test based on Monte Carlo simulations with the available evaluation data is proposed to test the significance of differences between these distributions. This test is compared with previous approaches to the problem of comparing predictive accuracies, and with other general tests for paired data.

The common practice of calculating MSEP for a score of models and then just selecting the model with the lowest value has been called derogatorily the ‘Eu- ropean song contest method’ [ 11. In this spirit the pres- ent paper is concerned with the question: “May we have the votes from Monte Carlo, please?“.

2. Method

Randomization tests (or permutation tests, as they are sometimes called) are an old subject in statistics [ 21, but were not very popular until recently because of their computing-intensive nature. Edgington [ 31 and Manly [4] have provided readable accounts of these methods. In this section I describe the application of randomization tests to the prediction-method-comparison problem. After introducing notation some general points are discussed first: the need for an exchangeable evaluation set, the null hypothesis to be tested, and the interpretation of non-significant and significant results. Then the test procedure is explained and also described in an algorithmic fashion.

Let yi be the reference value of case i in an evaluation sets (i=l , . . . ,n) , and let $A7Ai and ynBi be predictions for yi from two competing models A and B, respectively. Prediction errors are defined by

e,=yi-fK; ,. eBi = Yi - YBi (3)

and the predictive accuracy of both models for the target population is estimated as

MSEP*= (l/n) te& (4a) i=l

MSEP,= (l/n) ie& i-l

(4b)

A comparison between models A and B will be made using the difference in MSEP as a test statistic. This statistic may also be calculated directly from the differences per case

di = eii - e&

as

(5)

T= I&EPA - MSEPB = Cdi/n = d (6)

The composition of evaluation set S deserves special attention. It would be preferable if the individual cases in S were independently drawn from the target population. However, this is often not realized in practice. One should then be prepared to assume that all cases in S and any new case drawn independently from the target population are exchangeable in the following sense: if one was somehow given this set of n + 1 prediction errors without the corresponding case identifications, then each permutation of case identifications would be regarded as equally probable [ 51. This sub- jectivistic assumption is non-trivial: it excludes for example the possibility of selecting cases for the evaluation set just because it is known or expected that they have lower (or higher) prediction errors. Note that the exchangeability assumption also excludes the use of training set cases for evaluating the predictive model (training set cases generally are expected to have lower prediction errors). In all cases where the exchangeability assumption is not met, the results of evaluation are valid only for the set S itself, and any generalization to a target population must be based on non-statistical arguments.

We may think of several hypotheses to test. Intui- tively, I-I,,: MSEP*=MSEP, or, equivalently, Ho: E(d) = 0, might seem attractive as a null hypothesis. However, this hypothesis cannot be tested by a randomization test because the logic of the test procedure (see below) needs not only zero expectation ford, but also a symmetric distribution. Under this more stringent null hypothesis of a symmetric distribution around zero each value of ( di I is derived from di or - di with equal probability, and this is the basis for the test procedure outlined below. This null hypothesis may sound rather technical, but happily it is automatically entailed by the more natural I&: squared prediction errors from both models have equal distributions.

It may be objected that neither of the null hypotheses, be it equal MSEP or equal distribution of squared

H. van der Voet I Chemometrics and Intelligent Laboratory Systems 25 (1994) 313-323 315

errors, represents a state of affairs to be expected in reality when two predictive models are compared. In general, equality only occurs by special construction. For example, predicting the outcome of a series of fair coin tosses by method A (always predict head, or nota- tionally 1) and method B (predict by an alternating sequence head-tail-head-tail-. . ., or 101010.. . ) is an example of a situation where the null hypothesis of a symmetric distribution around zero is actually true. For an example of a construction where d has exactly zero expectation but has an asymmetric distribution see the type 2 simulations below. In practice, however, we are not interested in such artificial examples, and strict equality of the true MSEP is unlikely. Comparing predictive methods resembles in this respect the ordinary procedure of comparing varieties in agricultural research. Here as well the true difference in yield of two varieties will almost certainly be non-zero. This does not diminish the usefulness of statistical testing, as the purpose of any statistical test is to compare models of reality with each other, usually a simple model that states equality of certain features against a more complex model. Without evidence to the contrary we prefer the simple model (e.g. equal predictive ability for two methods of prediction), also when we know perfectly well that the world is more complex. When a comparison of two predictive models does not show a significant difference we conclude therefore that the results could have been generated by models having equal MSEP or equal distribution of prediction errors, not that they actually are generated in this way.

The additional constraint of symmetry for the distribution of the case scores d was also used in the first published randomization test, by Fisher [ 21. Therefore this type of test is now commonly called Fisher’s randomization test (see e.g. Ref. [3], pp. 18 and 19, or Ref. [4], pp. 18 and 43-49). The hypothesis implies that both models have equal mean predictive accuracy (MSEP, = MSEPu), but also that the probability for gross errors is equal for both methods. The alternative hypothesis therefore includes different kinds of situa- tions. Apart from the usual situation MSEP, > MSEP, or MSEP,< MSEPu, we may also have MSEP, =MSEPu with for example a symmetric distribution for e& and a skewed distribution for &. Therefore, rejection of the null hypothesis of equal distributions should be followed by a critical examination of the type of departure. Graphical methods are well suited for this,

for instance a scatter plot of the points (eAi, eui), or plots of the points (yi, eAi) and (yi, cut). If distributions of prediction errors are very unequal, but MSEP values do not differ much, then it remains to be decided on non-statistical grounds which prediction method should be preferred in practice, the one that avoids gross errors, or the one with usually better performance but with incidental gross errors.

With a properly constructed evaluation set S the randomization test for the hypothesis that the squared prediction errors & and e& come from equal distributions now proceeds as follows. Under the null hypothesis the labelling of prediction errors for each case i with A or B is immaterial, and this provides the basis for the randomization procedure. In principle 2” randomization sets can be generated from the actual data by inter- changing labels A and B for some cases. A suitable test statistic T is calculated for all sets. Considering differences in MSEP as the most relevant alternative to be detected it is sensible to construct a test statistic which is sensitive for these deviations. The one-sample t-test statistic applied to the differences di = eii - e& is therefore a sensible choice, because the mean difference d= Xdiln is equal to the difference of the I&EP values. This statistic can be written as

T=&,/[var(d,)ln] (7)

Consequently the test proposed here is often called a randomization t-test. As usual with randomization tests many equivalent test statistics exist, e.g. statistic (6) already introduced above, or T= tiSEP,/&lSEPu. The statistic (6) is convenient numerically (see below) and will be used here, whereas the latter choice mimics in its form the F-statistic for the comparison of variances in two independent samples. For any choice of T the calculations give an empirical distribution of T values. Under the null hypothesis the actual realiza- tion of the test statistic in the evaluation data ( TObs) is not expected to be extreme in the empirical distribution, whereas a properly chosen test statistic will attain extreme values under the alternative. In practice the number 2” is often too large for actual computation, but a not too small random sample of randomization trials will give a sufficiently good approximation of the randomization distribution of T. For a one-sided altema- tive hypothesis MSEP,> MSEP, the test therefore may proceed as follows (see also Appendix 1) :

316 H. van der Voet I Chemometrics and Intelligent Laboratory Systems 25 (1994) 313-323

1. Calculatedi=&-&, i=l,...,n 2. Compute T= &EPA - MSEPu = d= ( 1 ln) Cdi

for the actual evaluation data (Tabs) 3. Iterate steps 3a-3b m times:

a. Attach random signs to di, i = 1,. . .,n b. Calculate T= d

4. Calculate the significance level p = kl (m + 1), where m is the number of randomization trials, and k is the rank of TObs among the randomization values of T when ranked from high to low

For a one-sided alternative hypothesis MSEP, > MSEP, p is therefore the empirical upper tail probability in the randomization distribution of T, i.e. the fraction of trials with T2 Tabs, where the trial using the actual evaluation data is included in both numerator and denominator. For a test sensitive to the alternative MSEP, < MSEP, the ranking should be from low to high, and for a test sensitive to the two-sided alternative MSEP, # MSEP, the test statistic should be adapted to I TI for T=MSEPA- MSEP,, and to max( T,T-‘) for T=MSEP,&lSEP,. Note that the significance level from a randomization test can never be lower than 1 / (m + 1) . For practical use where a significance level 0.05 is often used for interpretation, m = 19 is therefore the minimum number of randomization trials needed, but m = 99 or m = 199 is then a more reasonable choice permitting p values down to p = 0.01 or 0.005. Jiickel [ 61 suggests that with m = 199 the power is at least 85% of the power of the exact test, i.e. the test when all 2” permutations would be used.

The implementation of the proposed randomization test is simple in many programming environments. A description in Matlab code [ 71 is given in the Appen- dix. For this paper the statistical program Genstat [ 81 was used.

3. Comparison with previous approaches

Wallach and Goffinet [9] end their paper about model evaluation using MSEP (which they denote MSEP(fi) to stress that predictive accuracy is condi- tional on estimated parameters fi of the model) by remarking: The results presented do not make it possible to test whether two models have significantly different values of MSEP@). It would obviously be of interest to define such tests. In one of their applications [lo] they come close to a solution when using boot-

strap procedures to apply a bias correction for MSEP estimates based on the same data as the model. Boot- strap procedures, like randomization procedures, depend on the independence and exchangeability of the individual evaluation cases. However, they stop short of presenting a test statistic and give only a point estimate of the (bias adjusted) difference in MSEP between two models.

In chemometrics the problem of comparing the predictive accuracy of models has received attention in the context of selecting the number of components in partial least squares (PLS) models. Osten [ 111 claims that the ‘absolute minimum in PRESS’ criterion performs poorer than several other criteria for selecting the number of components. The poorer predictive ability is shown in his Tables 2(b) and 3(b), where he presents standard errors of estimate for an evaluation set (n= 11) together with a variability measure obtained by bootstrapping the training set (n = 23) : e.g. 4.5 f 3.1 for the ‘absolute minimum in PRESS’ criterion against 3.4 f 1.7 for the best alternative criterion. The large variabilities seem to question the significance of Osten’s conclusion with respect to predictive ability, and a formal test to compare the alternative criteria seems useful.

Haaland and Thomas [ 121 proposed an F-statistic for comparing the predictive abilities of two calibration methods. For the situation where no independent evaluation data are available they suggested PRESSA/ PEES&, as a criterion, to be compared with an F- distribution with n and n degrees of freedom, where PRESS is the leave-one-out cross-validation version of C;_,,$. However, they admit that in practice the assumptions of the test are not met, especially the assumption of independence of prediction errors between models. In the likely event that prediction errors are positively correlated between methods, the F-test is conservative. This is equally true for the corresponding criterion MSEP,/MSEP, in the situation where an independent evaluation set is available.

Several authors (Wold [ 131, Eastment and Krza- nowski [ 141) have constructed criteria to guide in the choice of the optimal number of components in principal component analysis. Osten [ 111 adapted the criterion of Eastment and Krzanowski for latent variable regression techniques. This criterion can be written ~~P~~S,-~RE~~,+,~~~df,-df,+,~l~ [ PRESS,+,/df,+,], where ordinarily df,= (n-k) *p

H. van der Voet I Chemometrics andlnklligent Laboratory Systems 25 (1994) 313-323 317

and p is the number of parameters fitted per dimension. However, unlike Wold and Eastment and Krzanowski, Osten proposes the use of an F-distribution for significance testing. No justification is given, and the falsity of this distributional assumption is easily seen from the fact that his ‘F-to-test’ becomes negative whenever an additional component increases PRESS. Again, this conclusion carries over to the analogous situation where prediction errors are estimated from independent evaluation data.

Wakeling and Morris [ 151 recognized that “most of the methods which have been used to date are somewhat arbitrary and data specific”, and identified the need for a more formal test. They proposed the cross- validated r-squared, *v& = 1 - PRESS&~_ 1 ( yi -

jj) *, as a criterion, and derived tables of critical values from extensive Monte Carlo simulations using matrices of mean centred uniform random numbers. In the presence of correlations between the predicting variables the test is conservative. Wakeling and Morris consider also a randomization test, where empirical distributions of ?m,k for several values of k are obtained by repeat- edly permuting the values of y. Note that all tests of Wakeling and Morris test the null hypothesis: a speci- fied model has no predictive ability, against the alternative that it does. They are therefore not tests for the comparison of competing models.

4. Comparison with other general tests for paired data

The proposed randomization t-test is a general distribution-free test for the equality of two distributions using paired data. It does not consider the nature of the terms in the differences di = ek - et (squared prediction errors). Therefore it may be asked if other general tests for paired data can be used with equal success for MSEP comparisons. Three such alternative tests are considered here. First, the parametric one-sample t-test may be used. This test assumes a normal distribution for the di, but is known to be robust against violations of this assumption. Asymptotically, the t-test is even distribution-free provided only that the influence for each individual case tends to zero. Secondly, an alternative based on the ranking of the di is Wilcoxon’s signed-rank test. This test assumes symmetric distributions Fi with mean zero for the di, and tests

against the alternative that these distributions are slanted towards either positive or negative values [ 161. Note that the assumption about symmetry for di is automatically fulfilled under the null hypothesis, where e, and eBi have equal though possibly asymmetric distributions. A third alternative might be the sign test, which can be used to test the same hypothesis as Wilcoxon’s signed-rank test, but uses only the signs of the dti Alter- natively, the sign test can be seen as a test of the hypothesis P( le,l > le,,l) =0.5. Lehmann [ 161 compared these three tests, and concludes that in case of normality the efficiency loss of the sign test relative to the t-test may be considerable, whereas the efficiency of the Wil- coxon signed-rank test relative to the t-test is typically about 0.95. In the case of non-normal distributions the latter relative efficiency may be considerably above 1, at least asymptotically. Gross errors (statistically best modeled as the likely outcomes of heavy-tailed distributions) seem to be the most important reason for a low efficiency of the t-test. For chemical data heavy- tailed distributions are found quite generally [ 171.

5. Examples of application

The comparison of the prediction accuracy of three methods will illustrate the usefulness of the test approach. The three methods have been used to predict the caffeine content from NIR data for a data set con- sisting of 134 samples of coffee, some decaffeinated, others prepared by mixing decaffeinated and ordinary coffee. All samples were analyzed by a reference method and by near-infrared spectroscopy. A random selection of 45 coffee samples was set aside to serve as evaluation data. The remaining 89 coffee samples were used to obtain a predictive model using three calibration methods: MLRl: multiple linear regression using four wavelengths in the NIR spectrum (2260, 2220, 2300 and 1388 run); MLR2: multiple linear regression after deletion of six training samples in order to improve the correlation, and using four wavelengths (2260, 2212, 2300 and 1388 nm); PLS: partial least squares regression using the full spectrum ( 1100-2500 nm) sampled at 4 nm intervals. One PLS component was used in the predictive model. All MLR and PLS computations were performed using the NSAS software package [ 181 and involved some NIR expert judgment which need not concern us here.

318

Table 1

H. van der Voet / Chemometrics and Inrelligent Laboratory Systems 25 (1994) 313-323

Example: application of the proposed randomization test on the coffee data

Prediction method tiSEP Comparisons with MLRl

Significance values obtained by:

t rand.-t Wilcoxon sign

MLRl 0.00151 PLS 0.00243 < 0.001 5 0.005 <O.OOl < 0.001 MLR2 0.00384 0.25 0.36 0.40 0.76

All tests are two-sided. Randomization i-test: 199 trials. Approximate F-test: test statistic is ratio of larger to smaller I&EP.

appr.-F

0.06 0.001

Table 1 lists the results obtained on the evaluation data. On first inspection of the &iSEP values MLRl may seem to perform much better than MLR2 with PLS in an intermediate position. According to the approximate F-test on the ratio of I&EP values the difference between MLRl and MLR2 is highly significant (p = 0.001)) whereas the difference between PLS and MLRl is only indicative (p = 0.06). It should be borne in mind that the assumptions for the F-test are not fulfilled.

between MLRl and PLS the difference seems to be caused by a much larger part of the observations. Indeed, in the large majority of the evaluation cases (36 out of 45) the absolute prediction error of PLS was larger than for MLRl, which is in line with the obtained significance in the sign and randomization r-test.

Indeed, the other tests reveal a quite different view. The significance pattern conflicts with intuitive expec- tations: MLRl (lowest I&EP) differs very significantly from PLS (intermediate @EP), but no significant difference with MLR2 (highest &ISEP) is found. In this case the explanation for this behaviour of the test is easily seen from the distribution of the absolute prediction errors (Table 2). The high $lSEP value of MLR2 is caused for a large part by two extremely bad predictions. Without these two cases the $lSEP for MLR2 is 0.00142, which is even lower than the corresponding value 0.00151 for MLRl. Clearly, high prediction errors for only two observations cannot account for a significant difference in prediction accuracy, which explains the nonsignificant result of the sign and randomization t-test. In the comparison

The general message from this example is of course that the inspection of a list of &iSEP values may easily lead to very misleading conclusions about prediction performance and method comparison. The use of a proper test may help to discern repeatable superiority from incidental differences.

A second example of the application of the proposed test is in selecting the dimensionality of PI_S model. The data set consists of 174 samples of peas, analyzed by a reference method for the percentage of alcohol- insoluble matter and by near-infrared spectroscopy. PLS models with OJ,. ..,lO components were calibrated using the statistical program Genstat [8] on a random subset of 116 samples, and these 11 models were used to predict the remaining 58 evaluation samples. The minimal kSEP was obtained using nine components with local minima using zero and seven components (see Table 3). Application of the two- sided randomization test showed no significant differences with models using six to eight components,

Table 2 Distribution of absolute prediction errors for the 45 evaluation samples in the coffee data example

Method Magnitude of absolute prediction error

O-0.05 0.05-0.10 O.lckO.15 0.154.20 0.2&0.25 0.25~.30

MLRl 37 8 0 0 0 0 PLS 26 19 0 0 0 0 MLR2 35 8 0 1 0 1

H. van der Voet / Chernometrks and Intelligent Laboratory Systems 25 (1994) 313-323 319

Table 3 Example: application of the proposed randomization test on the pea data for selecting PLS model dimensionality

0 1 2 3 4 5 6

kSEP 9.22 9.89 9.47 7.83 6.68 2.66 1.78 p values 0.005 0.005 0.005 0.005 0.005 0.015 0.190

Two-sided tests, 199 iterations. p values are for comparisons with nine-component PLS model.

7 8 9 10

1.56 1.59 1.53 1.64 0.780 0.670 - 0.510

whereas models with zero to five components turned out to have significantly worse predictive ability. Therefore, parsimony might suggest the use of a PL.S model with six components.

6. Simulations

A small simulation study was performed to compare the four tests for paired data in the context of MSEP comparison. Two types of simulation have been performed. In the first type predictions involved a random element, such as would be the case when different sets of chemical measurements, possibly with unequal amounts of error, were to be used for prediction using the same statistical method. In the second type of simulation two different prediction methods were used on the same simulated data.

Reference values for calibration yi, i= 1,. . .,n, and for evaluation yi, i = n + 1,. . . ,2n, were generated using the model

yi=pxi+Ei (8)

where the design points Xi were equal for both sets, and were chosen uniform on the interval [O,n] for type 1 simulations, or containing some outlying n values for type 2 simulations. The random components Ei were independently drawn from N( O,&, i.e. a normal distribution with mean 0 and variance c?.

In simulations of type 1 the xi were assumed to be unobservable. Two simple prediction methods were calibrated on the first set (i = 1,. . .,n) using simple linear regression of yi on observables .& and ZBi, respectively. Values for the observable zAi were generated by adding random deviations to the unobservable xi:

zAi=xi+ 8, (9)

with random contributions 8, independently drawn from N( O,C&) . The simulation model for the observable zsi of method B was

zBi=&+ hi, i= l,...,n (10)

z~~=~Bx~+&~, i=n+l,...,2n (11)

where yB represents a systematic deviation for the evaluation cases when 3/B# 1, and where hi is either a random draw from N( O&2,), or is obtained from a mixture of normal distributions N( O,dn) and N( O,&) with a mixing probability rBl for the second distribution. A mixture distribution with a’& > a”B iS used to obtain a heavy-tailed distribution.

The regression coefficients obtained from the calibration data {yi,zAi} and {y$Bi}, i = 1,. . . ,n, were used to generate predictions yAi and 9Bi using the observables zAi and ZBi, i = n + 1,. . .,2n, respectively.

In the second type of simulations the xi were considered to be observable, but now two different prediction methods were calibrated on the data {y+Xi}, i = 1,. . .,n.

Method A predicted each evaluation case with the mean of the calibration data, method B made use of the information in xi, using the regression coefficients a and b

from a simple linear regression of yi on xi, i = 1,. . . ,n. It is easy to calculate exactly which method has superior MSEP over the design points. According to standard regression theory the expected MSEP at each design point is for method A the sum of squared bias and variance. For method B, which employs the correct model, bias is absent but the variance is somewhat increased. MSEP values over all design points are obtained by averaging:

MSEP,= (l/n) t [f12(xi-Y)2+c?( 1+ l/n)] i-l

= 02( 1 + 1 /n) + p2SSXln (12)

320 H. van der Voet 1 Chemometrics and Intelligent Laboratory Systems 25 (1994) 313-323

Table 4 Results of type 1 simulations

Simulation parameters Fraction of significant results (p < 0.05)

a.4 -YE oB oa =u Parametric t-test Randomization t-test Sign test Wilcoxon signed-ranks test

1 1 1 - - 0.057 0.049 0.038 0.059 1 1 1.75 - - 0.734 0.730 0.341 0.632 1 0.95 1 - - 0.770 0.762 0.467 0.718 1 1 1 10 0.2 0.730 0.936 0.565 0.782 1 1 1 10 0.1 0.264 0.587 0.239 0.412

Two-sided tests with inconfidence level (Y = 0.05, 199 iterations in each randomization t-test, 1000 simulations. n = 50, xi = { 1,. . .,50}, p = 1, (T= 1.

Table 5 Results of type 2 simulations

Simulation parameters Fraction of significant results (p < 0.05)

B Parametric t-test Randomization t-test Sign test Wilcoxon signed-ranks test

0.01651 ( = &,) 0.074 0.092 0.051 0.091 0.1651 0.503 0.859 0.503 0.814

Two-sided tests with inconfidence level a= 0.05, 99 iterations in each randomization t-test, 1000 simulations. a= 25, xi= {1,...,23,50,50}, u= 1.

i-l

=$(1+2/n) (13)

where SS, = C( xi - a) 2. Equating these MSEP values it is found that methods A and B have equal MSEP when fi is equal to the standard error of its least squares estimator b:

P=&=a/~SS,=l& (14)

In the simulations of type 2 values p = PO and /I = lop, were used. In the latter case method B is expected to outperform method A.

The results are shown in Tables 4 and 5. For a proper interpretation remember that the standard error of each fraction P in the table is J[ P( 1 - P) / lOOO] , e.g. 0.007 for P = 0.057, and 0.013 for P = 0.770. The first simulation of Table 4 implements the null hypothesis of equally distributed prediction errors. The fraction of significant results therefore estimates the size (Y of the tests. All four tests have values near the theoretical value 0.050. This confirms the theoretical properties of the distribution-free tests, and is a tribute to the robust-

ness of the parametric t-test to the non-normal distribution of di. The second simulation shows the power of the tests under the alternative hypothesis that method B performs worse ( MSEPB > MSEP,) . Method B has lower predictive accuracy because the information con- tained in n is now contaminated with more noise in the observable zg. The sign test and the Wilcoxon test are seen to be less efficient than both r-tests. The third simulation shows that the presence of a systematic error proportional to x in the predictions of method B is detected with about equal power by the two t-tests while the Wilcoxon test is somewhat less efficient. The sign test is again seen to be the least powerful. In the fourth and fifth simulation both the parametric t-test and the sign test have less power to discover the gross errors made by method B. The reasons however are different: for the sign test it is inefficient use of the data (only the signs are used), for the parametric r-test it is the failure of the normality assumption. Also the Wilcoxon test is not very efficient in comparison to the randomization f-test.

Table 5 shows the results of the type 2 simulations. The first simulation of Table 5 shows the behaviour of the tests when two methods have equal MSEP, but

II. van der Voet I Chemometrics and Intelligent Laboratory Systems 25 (1994) 313-323 321

unequal distributions. Only the sign test is seen to have the right size. The other tests give significant results in more than 5% of the cases. The fraction significant results has in this case to be interpreted as the power for the alternative of specific unequal distributions. Note however, that no effort has been made here to construct test statistics which are especially sensitive for this alternative. In the second simulation the information in x outweighs the random error. The difference between methods A and B is detected more often by the randomization t-test and the Wilcoxon test than by the parametric r-test and the sign test. Note also that the Wilcoxon signed-ranks test is here again not quite as powerful as the randomization t-test. The reason is again, as for the sign test, incomplete use of the data (only the signed ranks are used and not the actual values).

7. Extensions

The proposed randomization t-test for the comparison of the predictive accuracy of two methods is useful as such, but may also serve as the simplest example for a more general approach to method comparisons. In the example the emphasis was on the comparison of statistical methods applied to the same data. However, there is no reason to restrict the use to this situation: also different chemical methods, e.g. different screen- ing methods, may be compared with respect to their ability to reproduce reference values. This use was in fact already implicit in the type 1 simulations.

The randomization r-test can be extended in several ways. First, when more than two prediction methods are to be tested simultaneously an obvious extension is the randomization analysis of variance, see e.g. Refs. [3,4]. When more response variables are predicted simultaneously, as e.g. in PLS2 or multivariate linear regression, method comparison may involve randomization tests for multivariate analysis of variance [ 31.

It was seen in the simulations that the parametric t- test performs quite well as long as there are no gross errors. An alternative to the use of the randomization f-test may therefore be to apply some sort of robust test, e.g. using trimmed data.

In this paper the prediction of quantitative responses was discussed. However, many chemometrical problems are of the classification type, e.g. the classification

of samples as positive or negative in the inspection of unwanted substances, or the origin classification of food products or environmental contaminations. It is then often of interest to compare the performance of several proposed classification methods. One of the most common criteria used for this purpose is the pro- portion of erroneously classified cases, the error rate (or its complement the non-error rate). However, the comparison of classification methods is very insensi- tive when only the class assignments of the competing methods are used. In most cases only a small part of all evaluation cases is classified differently by the two methods. A method comparison can be done using the sign test, but will then be based only on these differently classified cases. Van der Voet and Doombos [ 191 compared classification methods with respect to their predictive performance on real data, and noted that a difference of 5 or less in the number of cases misclas- sified by two methods can never be significant at the 95% confidence level by the two-sided sign test, regard- less of the size of the evaluation set or the pattern of misclassification. A typical example is the comparison of two methods which misclassify 2 and 4 cases, respectively, out of 100 evaluation cases, so that a sign test comparison can be based on only 6 out of all 100 cases at most. Clearly the (non-)error rate is not very well suited for a detailed method comparison using tests. However, an alternative test procedure can be found if probabilistic classification methods are used. Probabi- listic methods (standard discriminant analysis being a prime example) assign posterior probabilities to the evaluation cases for each class under consideration. It is then possible to construct performance scores from these probabilities, which, like &ISEP, are means over the individual cases. Consequently the tests of this paper can also be applied to the comparison of probabilistic classification methods (see for example Van der Voet and Doombos [ 191, who used the parametric t-test). Some care has to be taken in selecting the performance measure. If c( i) denotes the real class mem- bership of evaluation case i, and if Pi=(i) is the probability assigned to this class by the classification method, then the average probability assigned to the proper class, ( l/n)CPico, may look an appealing criterion. However, it can be shown that this is not a proper scoring rule in the sense that it gives undue preference to methods with too extreme probabilities [20,21]. Better performance criteria are for example the aver-

322 H. van der bet / Chemometrics and Intelligent Laboratory Systems 25 (1994) 313-323

aged log-likelihood, (1ln)C ln(P,(,,), or the quad- ratic score (11n)C[(1-P~~(i,)2+C~~c(i,(Pik)23

[ 221. The individual components of these criteria can be used in the same way as the squared prediction errors in the paired comparison tests.

8. Conclusions

Some simple conclusions can be drawn. First, judg- ing the predictive performance of predictive methods by the ordering of &ISEP values only can be very misleading. It is possible to test the hypothesis that two methods have equal predictive ability as expressed in the distribution of prediction errors. The randomization t-test seems a good choice as the results in this paper show. The parametric r-test will often perform equally well, but is sensitive to the presence of gross errors. The sign test is much less efficient, and should only be used if the hypothesis P( 1 e, I < 1 eui I ) = 0.5 (without consideration of the absolute magnitude of the errors) is to be tested. The Wilcoxon signed-ranks test has no clear place other than as a more accessible but less powerful alternative to the randomization t-test.

Rejection of the equal-distribution hypothesis is possible in the absence of any difference in MSEP. Then other parameters of the prediction error distributions are unequal, and it remains to be decided on non-statistical grounds which prediction method should be preferred in practice, e.g. the one that avoids gross errors, or the one with usually better performance but with incidental gross errors. Graphical methods are advised to study the pattern of prediction errors.

A final conclusion is that a similar test approach can be used for classification problems.

Acknowledgements

I wish to express my gratitude to Durk Doombos, who initiated me both in analytical chemistry and in statistics. Thanks are also due to Willem Drost (TNO) for use of the coffee data, and to Peter Groenendijk and Rob Frankhuizen (RIKILT-DLO) for the analysis of these data with NSAS, the latter of whom also provided the pea data. Cajo ter Braak (GLW-DLO), Michiel Jansen (GLW-DLO) and Age Smilde (University of

Amsterdam) were very stimulating in their comments on this work.

Appendix. Matlab code for the proposed randomization t-test comparing the predictive accuracy of two methods

In the Matlab code below y, yhatA and yhatB are vectors of equal length containing the actual values and predictions from models A and B for n evaluation cases. niter is the number of randomization trials. rand is a function generating random uniform numbers. This implementation is for a two-sided test; for a one-sided test the taking of absolute values (function abs) should be deleted.

eA= y- yhatA; eB= y- yhatB; diff=eA.^2-eB.^2; meandiff=mean(diff); n=length(diff); niter = 199 sum= 0; for k = 1: niter

randomsign=2*round(rand(l,n))-1; signeddiff =randomsign. *diff; meansigneddiff=mean(signeddiff); sum=sum+... (abs(meansigneddiff)> =abs(meandiff));

end pvalue=(sum+l)/(niter+l)

References

[ 11 A.G.M. Steememan, personal communication, 1994. [ 21 R.A. Fisher, The Design of Experiments, Oliver and Boyd,

Edinburgh, 1935.

[3] ES. Edgington, Rundomizarion T&s, Marcel Dekker, New

York, 2nd edn., 1987.

[4] B.F.J. Manly, Randomization and Monte Carlo Methods in Biology, Chapman and Hall, London, 1991.

[5] M. Stone and P. Jonathan, Statistical thinking and technique

for QSAR and related studies. Part I: General theory, Jourmzl

of Chemometrics, 7 (1993) 455-475.

H. van der Voef I Chemometrics and Infelligenf Laboratory Systems 25 (1994) 313-323 323

[6] K.-H. JGckel, Monte Carlo techniques and hypothesis testing,

in P.R. Nelson, E.J. Dudewicz, A. &t&k and E.C. van der

Meulen (Editors), The Frontiers of Sfafisfical Computation, Simulation, andModeling, Volume 1 of the Proceedings of fhe ICOSCO-2 Conference, American Sciences Press, Syracuse,

NY, 1991, pp. 2142.

[7] Maflab User’s Guide for Microsoft Windows, MathWorks,

Natick, MA, 1993.

[8] Genstat 5 Committee, Genstar 5 Reference Manual, Clarendon

Press, Oxford, 1987.

[9] D. Wallach and B. Goffinet, Mean squared error of prediction

in models for studying ecological and agronomic systems,

Biometrics, 43 (1987) 561-573. [lo] D. Wallach and B. Goffinet, Mean squared error of prediction

as a criterion for evaluating and comparing system models,

Ecological Modelling, 44 (1989) 299-306.

[ 111 D.W. Osten, Selection of optimal regression models via CTOSS-

validation, Journal of Chemometrics, 2 (1988) 39-48. [ 121 D.M. Haaland and E.V. Thomas, Partial least-squares methods

for spectral analyses. 1. Relation to other quantitative

calibration methods and the extraction of qualitative

information, Analyrical Chemistry, 60 (1988) 1193-1202. [13] S. Wold, Cross-validatory estimation of the number of

components in factor and principal components analysis,

Technomefrics, 20 (1978) 39745. [ 141 H.T. Eastment and W.J. Krzanowski, Cross-validatory choice

of the number of components from a principal component

analysis, Technomefrics, 24 (1982) 73-77.

[ 151 I.N. Wakeling and J.J. Morris, A test of significance for partial

least squares regression, Joumal of Chemometrics, 7 (1993) 291-304.

[ 161 E.L. Lehmann, Nonparamefrics: Sfafisfical Mefhods Based on Ranks, Holden-Day, San Francisco, CA, 1975.

[ 171 M.A.J. van Montfort, SfafisficalRemarkson RoundRobin Data of IPE and ZSE, Technical Note 92-02, Department of

Mathematics, Wageningen Agricultural University, Wage-

ningen, 1992.

[ 181 NSAS Manual for Near-infrared Spectral Analysis Software, Version 3.25, NIRSystems, Silver Springs, MD, USA, 1989.

[19] H. van der Voet and D.A. Doombos, The improvement of

SIMCA classification by using kernel density estimation, Part

2. Practical evaluation of SIMCA, ALLOC and CLASSY on

three data sets, Anulyfica Chimica Acta, 161 (1984) 125-134. [20] R.L. Winkler, Scoring rules and the evaluation of probability

assessors, Journal of the American Sfatisfical Association, 64 (1969) 1073-1078.

[21] H. van der Voet, P.M.J. Coenegracht and J.B. Hemel, The

evaluation of probabilistic classification methods, Part 1. A

Monte Carlo study with ALLOC, Analytica &mica Acfa, 191

(1986) 47-62.

[22] J. Hilden, J.D.F. Habbema and B. Bjerregaard, The

measurement of performance in probabilistic diagnosis, III.

Methods based on continuous functions of the diagnostic

probabilities, Methods of Information in Medicine, 17 (1978) 238-246.

comparing the predictive accuracy of models using a simple randomization test

Documents