statistical validation of normal tissue complication probability models

7
Physics Contribution Statistical Validation of Normal Tissue Complication Probability Models Cheng-Jian Xu, PhD,* Arjen van der Schaaf, PhD,* Aart A. van’t Veld, PhD,* Johannes A. Langendijk, MD, PhD,* and Cornelis Schilstra, PhD* ,y *Department of Radiation Oncology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands; and y Radiotherapy Institute Friesland, Leeuwarden, The Netherlands Received Oct 31, 2011, and in revised form Jan 13, 2012. Accepted for publication Feb 9, 2012 Summary Normal tissue complication probability (NTCP) models must be statistically validated before clinical use. In our study we used repeated double cross-validation and permuta- tion tests to validate NTCP models for xerostomia after radiation therapy treatment of head-and-neck cancer. Repeated double cross- validation showed the variability of prediction performance and variable selection. The statistical significance of the model can be defined when compared with the equivalent perfor- mance measures of permuted data models. Purpose: To investigate the applicability and value of double cross-validation and permutation tests as established statistical approaches in the validation of normal tissue complication prob- ability (NTCP) models. Methods and Materials: A penalized regression method, LASSO (least absolute shrinkage and selection operator), was used to build NTCP models for xerostomia after radiation therapy treat- ment of head-and-neck cancer. Model assessment was based on the likelihood function and the area under the receiver operating characteristic curve. Results: Repeated double cross-validation showed the uncertainty and instability of the NTCP models and indicated that the statistical significance of model performance can be obtained by permutation testing. Conclusion: Repeated double cross-validation and permutation tests are recommended to vali- date NTCP models before clinical use. Ó 2012 Elsevier Inc. Introduction Normal tissue complication probability (NTCP) modeling (1) in radiation therapy aims to describe the relationship between dose distribution parameters, whether or not corrected for other vari- ables, and the probability of side effects. In this regard, NTCP models can support early clinical decision making, be used for the evaluation and optimization of radiation therapy treatment Reprint requests to: Cheng-Jian Xu, PhD, University Medical Center Groningen, Department of Radiation Oncology, P.O. Box 30.001, 9700 RB Groningen, The Netherlands. Tel: (31) 50-361-0039; Fax: (31) 50-361- 1692; E-mail: [email protected] Supported by the ALLEGRO (eArLy and Late hEalth risks to normal/ healthy tissues from the use of existing and emerGing techniques for RadiatiOn therapy) project. Conflict of interest: none. Int J Radiation Oncol Biol Phys, Vol. 84, No. 1, pp. e123ee129, 2012 0360-3016/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.ijrobp.2012.02.022 Radiation Oncology International Journal of biology physics www.redjournal.org

Upload: rug

Post on 19-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

International Journal of

Radiation Oncology

biology physics

www.redjournal.org

Physics Contribution

Statistical Validation of Normal Tissue ComplicationProbability ModelsCheng-Jian Xu, PhD,* Arjen van der Schaaf, PhD,* Aart A. van’t Veld, PhD,*Johannes A. Langendijk, MD, PhD,* and Cornelis Schilstra, PhD*,y

*Department of Radiation Oncology, University of Groningen, University Medical Center Groningen, Groningen, TheNetherlands; and yRadiotherapy Institute Friesland, Leeuwarden, The Netherlands

Received Oct 31, 2011, and in revised form Jan 13, 2012. Accepted for publication Feb 9, 2012

Summary

Normal tissue complicationprobability (NTCP) modelsmust be statistically validatedbefore clinical use. In ourstudywe used repeated doublecross-validation and permuta-tion tests to validate NTCPmodels for xerostomia afterradiation therapy treatment ofhead-and-neck cancer.Repeated double cross-validation showed thevariability of predictionperformance and variableselection. The statisticalsignificance of the model canbe defined when comparedwith the equivalent perfor-mance measures of permuteddata models.

Reprint requests to: Cheng-Jian Xu, PhD, U

Groningen, Department of Radiation Oncology,

Groningen, The Netherlands. Tel: (31) 50-361

1692; E-mail: [email protected]

Int J Radiation Oncol Biol Phys, Vol. 84, No. 1

0360-3016/$ - see front matter � 2012 Elsevie

doi:10.1016/j.ijrobp.2012.02.022

Purpose: To investigate the applicability and value of double cross-validation and permutationtests as established statistical approaches in the validation of normal tissue complication prob-ability (NTCP) models.Methods and Materials: A penalized regression method, LASSO (least absolute shrinkage andselection operator), was used to build NTCP models for xerostomia after radiation therapy treat-ment of head-and-neck cancer. Model assessment was based on the likelihood function and thearea under the receiver operating characteristic curve.Results: Repeated double cross-validation showed the uncertainty and instability of the NTCPmodels and indicated that the statistical significance of model performance can be obtained bypermutation testing.Conclusion: Repeated double cross-validation and permutation tests are recommended to vali-date NTCP models before clinical use. � 2012 Elsevier Inc.

Introduction

Normal tissue complication probability (NTCP) modeling (1) inradiation therapy aims to describe the relationship between dose

niversity Medical Center

P.O. Box 30.001, 9700 RB

-0039; Fax: (31) 50-361-

, pp. e123ee129, 2012

r Inc. All rights reserved.

distribution parameters, whether or not corrected for other vari-ables, and the probability of side effects. In this regard, NTCPmodels can support early clinical decision making, be used forthe evaluation and optimization of radiation therapy treatment

Supported by the ALLEGRO (eArLy and Late hEalth risks to normal/

healthy tissues from the use of existing and emerGing techniques for

RadiatiOn therapy) project.

Conflict of interest: none.

Xu et al. International Journal of Radiation Oncology � Biology � Physicse124

planning, and help to explore the mechanism of complications (2).In recent years, statistical learning methods have increasinglybeen used (3-8) to create NTCP models. However, the predictionmodels obtained by statistical learning cannot always be trusted,which makes statistical validation crucially important before suchmodels can be recommended for general clinical use.

An NTCP model is trained on a limited clinical database atsome points in time, with the objective of correctly predicting sideeffects in patients who will be treated in the future. At the time ofconstruction, it is impossible to know exactly how accurately themodel will predict side effects in new patients, because these newpatients are not yet available. Therefore, model performance isgenerally estimated using available databases.

The NTCP performance estimate is subject to severalrequirements. First, to avoid an overly optimistic performanceestimate, these validation databases should be independent: theymust not have been used to create the NTCP model. Second, theperformance estimate should take the uncertainty of the modelsinto consideration. The database used to build the NTCP model isa sample from the entire population, so the NTCP model that iscreated from the database is only one possible realization. Othersamples from the same population will result in different NTCPmodels. For example, different variables and/or different regres-sion coefficients could be selected.

These first 2 requirements can be fulfilled by cross-validation(9, 10). The purpose of cross-validation is to choose the optimalmodel parameters and to determine the true prediction perfor-mance of a statistical model. Cross-validation uses the availabledata minus a specified part (eg, 1/kth part of the total data set) to fitthe model, whereas the 1/kth part that was left out is used to testthe model. However, the prediction performance based on sucha single cross-validation loop is biased and often too optimisticbecause the model selection is based on the same set that is used todetermine the prediction performance. Hence, another separatetest set is required to determine the unbiased prediction perfor-mance. This third requirement can be fulfilled by double cross-validation (9, 11, 12). This procedure uses the available dataefficiently; all samples are used for both model building andvalidation by using a double resampling loop. Moreover, thestability of model performance can be assessed by the repeateddouble cross-validation procedure, whereby the resampled datasets are different and the cross-validated performance estimateincorporates the variability of the model performance.

Various criteria have been proposed to quantify predictionperformance, including the likelihood and the area under thereceiver operating characteristic curve (AUC). However, thevalues of these measures do not indicate statistical significance,so good prediction performance could result from chance alone.As the number of candidate predictors increases, the probabilityof good performance from chance alone becomes even higher.Therefore, a permutation test can be used to quantifythe significance of the prediction performance of NTCP models(12, 13).

In a permutation test, the labels of patients (with/withoutcomplications) are randomly reassigned to create an uninforma-tive data set with the same size as the original set. Building andtesting NTCP models on many permutations of data yieldsa distribution of the prediction performance, assuming that thedata are uninformative. Subsequently, the statistical significancecan be tested by comparison with this distribution. Theseapproaches represent established statistical techniques, but to ourknowledge they have not been previously used in radiation therapy

science, and their applicability in NTCP model validation stillneeds to be assessed.

The aim of our study was to determine how to apply the abovestatistical methods to the validation problem of NTCP models.The procedure was conducted on a data set concerning xerostomiaafter definitive radiation therapy of head-and-neck cancer.

Methods and Materials

Head-and-neck cancer patient data

The dataset contained 185 patients who were all treated withprimary 3-dimensional conformal radiation therapy for head-and-neck tumors, some in combination with chemotherapy. Xero-stomia was assessed at 6 months after completion of treatmentusing the Radiation Therapy Oncology Group (RTOG) late radi-ation morbidity score system (14). The primary endpoint wasdefined as RTOG grade �2 xerostomia; 106 patients wereassessed as having xerostomia.

For each patient, 21 candidate variables were initially includedin the variable selection procedure. The variables consisted of 5clinical and 16 dosimetric factors. The 5 clinical variables werechemotherapy, gender, age, treatment center, and baseline xero-stomia score.

To avoid collinearity problems in NTCP modeling, weexcluded highly correlated variables and selected only the meandoses as candidate variables. The dosimetric factors were the meandose given to the organs at risk and their volume. These organswere the lower lip, the soft palate, the contralateral and ipsilateralparotid gland, the contralateral and ipsilateral sublingual gland,and the contralateral and ipsilateral submandibular gland.

Statistical analysis

Logistic regression analysisBecause the complication outcome y is binary, y Z 1 meansoccurrence of a complication, and y Z 0 means no complicationafter treatment. The logistic regression model is defined as

NTCPZ1

1þ e�ðb0þb1X11þb2X12þ:::þbpX1pÞ ð1Þ

where p is the number of variables, and NTCP is the probability ofthe first patient having complication. X11, X12,., X1p representdifferent variables and b0, b1,., bp the corresponding regressioncoefficients.

Model assessment criterionTwo criteria were used to evaluate the prediction performance.One is the logarithm of the likelihood, which is defined as

ln LZXi

½ð1� riÞ lnð1�PiÞ þ ri lnðPiÞ� ð2Þ

Here Pi is the NTCP for patient i, and ri Z 1 when the ith patienthas shown the complication and ri Z 0 otherwise. Another modelassessment criterion is AUC. The likelihood shows the differencesbetween the measured results and the predictions by the model,and thus characterizes the accuracy of the model predictions,whereas the AUC focuses on the discriminative ability of themodel.

Fig. 1. Schematic representation of the double 10-fold cross-validation scheme that was applied on a xerostomia data set of185 patients.

Volume 84 � Number 1 � 2012 Statistical validation of NTCP models e125

Least absolute shrinkage and selection operator(LASSO)The LASSO method (15) was selected in this study for modelfitting because it often shows superior prediction performance (8),while at the same time a larger number of variables are selected.LASSO limits the absolute magnitude of the regression coeffi-cients. The variable selection problem in LASSO is formulated asfinding the elements of b that equal zero. Estimates are chosen tominimize

XniZ1

�yi � 1

1þ e�ðb0þb1X11þb2X12þ:::þbpX1pÞ�2

þlXpjZ1

jbij!

subject toXpjZ1

��bj

��� s ð3Þ

Here n is the number of patients in the data sets, and s is a tuningparameter that constrains how many predictor valuables will beselected in the final model. When s is large enough sucha constraint has no effect, and the solution is equal to the logisticalregression. In this study, the value of s was chosen by inner singlecross-validation.

SoftwareData analysis was performed on a Pentium 4 desktop computer.All programs for calculation were coded in MATLAB 7.10 forWindows (Mathworks, Natick, MA). LASSO was implemented inthe MATLAB glmnet package, and computations were performedwith the glmnet routine (16).

Statistical model validation: Double k-foldcross-validationSingle cross-validation is often used in building NTCP models (4).With single cross-validation, the total data set is divided into 2parts: the training set and the validation set. The validation set isthen used to develop the NTCP model and to test model perfor-mance. It should be noted that the same samples (in the validationset) are thus also used to find the best overall model parameter;therefore they are not completely independent, which is requiredfor a proper cross-validation. In contrast, in double cross-validation the total data set is divided into a training set, a vali-dation set, and a test set. A model can be developed and optimizedby using the training and validation sets. The test set is then onlyused to test the model performance. By repeating the procedure sothat each sample appears only once in the test set, the predictionperformance will be representative for the new samples.

In our analysis we used k Z 10 for double k-fold cross-validation. First, the original data set was divided into 10 parts.Next, 1 part was retained as the test set, and the remaining9 subsamples were used as training data and as a validation set. A9-fold cross-validation was then performed on the remaining 9parts to optimize the LASSO model in a nested validation scheme.After this, the validation sets were used to determine the optimals in Eq. 3, and the test set was used to find the predictionperformance of the resulting model. Finally, the double cross-validation with LASSO was expressed in the following pseudocode.

Divide the training data set into k parts:For i Z 1 to kFor j Z 1 to ke1Build NTCP models by LASSO with different s

End

Find an optimal sBuild NTCP model by LASSO with the optimal s

EndObtain prediction performance.

Figure 1 shows that about 1/10 of 185 patients (z18) wereused in the outer loop as an independent test set; the test set wasalways independent from the set used for model building,whereas the remaining 167 patients were used in the inner loop todevelop and optimize the model. The outer loop was repeated 10times, so each sample was used for prediction. The test setsyielded an estimation of the prediction performance of the modelsthat were obtained from the inner loop. The inner loop was thusused to fit the model, and the outer loop was used to obtainprediction performance. In the inner loop, a 9-fold cross-validation was applied, where 8/9 of 167 patients were used fortraining and 1/9 of 167 patients (z19) were used for validation.To obtain the best models from the training, the inner loop wasrepeated 9 times.

Because the k-fold cross-validation estimate is a randomvariable that depends on the partition of the data set, repeatingk-fold cross-validation multiple times using different splitsprovides a large number of NTCP models. All these models havedifferent prediction performance and variable selection. There-fore, the variability of the NTCP models can be shown from therepeated double cross validation.

Permutation testEssentially, the permutation test procedure measures the likeli-hood of obtaining the observed accuracy by chance. A model isdeemed to be invalid if there is an unacceptably high level ofprobability that random pairings of predicted and observed valueswill yield a better goodness-of-fit measure than was obtained fromthe original prediction sequence. This is arguably the most basicof all possible significance tests of a model’s predictive ability.

In the permutation test, the endpoints of patients (with orwithout complications) were exchanged: they were randomlyassigned to different individuals. An NTCP model was recalcu-lated using the “wrong” endpoints. The rationale is that with thewrongly labeled endpoints the recalculated NTCP model shouldnot yield a good prediction. Finally, we tested whether the originalprediction performance was significantly different from theperformance of the models based on the permutated data. Theprediction performance obtained form original data should beoutside the 95% or 99% confidence bounds of the distributionfrom the permutated data.

0.58

0.585

0.59

0.595

0.6

0.605

0.61

dlof3dlof5dlof01

Like

lihoo

d

a)

0.79

0.8

0.81

0.82

0.83

dlof3dlof5dlof01

AUC

b)

Fig. 2. Box plot of prediction performance by different k-fold choices of double cross-validations. (a) Prediction performance evaluatedby likelihood. (b) Prediction performance evaluated by area under the receiver operating characteristic curve (AUC).

Xu et al. International Journal of Radiation Oncology � Biology � Physicse126

Results

The prediction likelihoods resulting from the different partitionschemes are given in Fig. 2a, showing the accuracy of the modelprediction of xerostomia. In the box plot of Fig. 2, the first andthird quartiles are indicated by the edges of the box area. Theextreme values (within 1.5 times the interquartile range from theupper or lower quartile) are represented by the ends of the linesextending from the interquartile range. Points at a larger distancefrom the median than 1.5 times the interquartile range are plottedindividually. In Fig. 2, different k-fold cross-validations have beenused. When the k-fold was used as the outer loop, the ke1 foldwas applied as the inner loop. For example, if 10-fold cross-validation was used in the outer loop, 9/10 of samples wereused for training and optimizing the model, and 1/10 of them wereused for testing. In the inner loop, 9-fold cross-validation was usedto optimize s (Eq. 3) to find the best LASSO model. As shown inFig. 2a, different k-fold cross-validation schemes yielded verysimilar prediction results, whereas 3-fold cross-validation yieldedslightly lower prediction likelihood than 5-fold and 10-fold cross-validation. Compared with 3-fold and 5-fold cross-validation, the

Table 1 Different normal tissue complication probability models ob

No. For

Mode1 1 �4.5839 þ 0.0354 * V3 � 0.0007 * V7 þ 0.0293 ** V16 þ 0.9023 * V21

Model 2 �4.6925 þ 0.1900 * V2 þ 0.0288 * V3 � 0.0005 ** V14 þ (�0.0484) * V15 þ 0.0129 * V16 þ 0.91

Model 3 �3.9777 þ 0.0238 * V3 þ 0.0237 * V8 þ 0.0139 *

Definition of variables: V1 Z chemotherapy; V2 Z gender; V3 Z age; V4 ZV7 Z parotid gland contralateral volume; V8 Z mean dose to contralateral par

ipsilateral parotid gland; V11 Z sublingual gland contralateral volume; V12 Zsublingual gland; V14 Z mean dose ipsilateral sublingual gland; V15 Z volume

submandibular gland; V17 Z volume of ipsilateral submandibular gland; V18

lower lip; V20 Z mean dose to the lower lip; V21 Z baseline xerostomia scor

prediction likelihood of 10-fold cross-validation was relativelystable, because the 3-fold and 5-fold approaches used more datafor testing. No general conclusion can be drawn about whichcross-validation approach should be used, but in practice the10-fold or 5-fold approaches are used more often. In this study,10-fold cross-validation was used as a default for modeling.

The AUC was also selected as another criterion to evaluate themodel performance, because it describes the discrimination abilityof the NTCP models. Figure 2b again shows that 3-fold cross-validation prediction resulted in a somewhat lower value andcontained more variation than 5-fold and 10-fold cross-validation.

Furthermore, Fig. 2 illustrates the varying prediction perfor-mance of the model with the variation of sampling. Table 1 shows3 examples of 1000 NTCP models obtained by repeating 10-folddouble cross-validation. The variation of sampling can be clearlyseen in the 3 models. The median of prediction performance of10-fold double cross-validation is 0.605 (likelihood) and 0.825(AUC). Compared with the model obtained from single 10-foldcross-validation listed in Table 2, we found that single 10-foldcross-validation tended to overestimate the prediction perfor-mance (likelihood 0.63 vs 0.605, and AUC 0.86 vs 0.825).

tained by double cross-validation

mula of models

V8 þ 0.0167 * V10 þ 0.0797 * V11 � 0.0343 * V15 þ 0.0060

V7 þ 0.0223 * V8 þ 0.0293 * V10 þ 0.2448 * V11 � 0.009785 * V21

V10 þ 0.0147 * V15 þ 0.0115 * V16 þ 0.8692 * V21

medical center; V5 Z volume soft palate; V6 Z mean dose soft palate;

otid glands; V9 Z parotid gland ipsilateral volume; V10 Z mean dose to

mean dose to contralateral sublingual gland; V13 Z volume of ipsilateral

of contralateral submandibular gland; V16 Z mean dose to contralateral

Z mean dose to ipsilateral submandibular gland; V19 Z volume of the

e.

Table 2 Normal tissue complication probability model obtained by single cross-validation

Model Formula of models Likelihood AUC

Model 1 (single cross-validation) �4.1088 þ 0.0268 * V3 þ 0.0257 * V8 þ 0.0168 * V10 � 0.0208* V15 þ 0.0088 * V16 þ 0.8560 * V21

0.63 0.86

Abbreviation: AUC Z area under the receiver operating characteristic curve.

Definition of variables: Same as Table 1.

2 4 6 8 10 12 14 16 18 200

100

200

300

400

500

600

700

800

900

1000

Variable index

Occ

urre

nce

Fig. 3. Frequencies of selection of the 21 predictor variables byLASSO (least absolute shrinkage and selection operator) withrepeated double cross-validation. Frequent occurrence of a vari-able indicates that it is a common feature in modeling, whereasinfrequent occurrence indicates that it is an idiosyncratic featureof a specific model.

Volume 84 � Number 1 � 2012 Statistical validation of NTCP models e127

After 10-fold cross-validation was repeated 100 times, 1000NTCP models were obtained. In each model, different variableswere selected. We calculated the frequency of selection of these21 variables as shown in Fig. 3. Six selected variables havea chance �80% of being selected: age, baseline xerostomia score,mean dose to the contralateral/ipsilateral parotid gland, mean doseto the contralateral submandibular gland and volume of thecontralateral submandibular gland. During each modeling proce-dure, the common feature and the idiosyncratic features of thedata set were modeled together. However, only the commonfeature was frequently selected in different modeling procedures,and this feature is more important. Other features, such as themean dose to the soft palate (V6), which was selected onlyapproximately 200 times in 1000 models, are regarded as idio-syncratic features and are less important in the NTCP model.

Figure 4 illustrates the actual prediction performance of theNTCP model and distribution obtained with 1000 permutations.The endpoints of patients (with or without xerostomia) wererandomly exchanged (ie, they were randomly assigned to differentindividuals). However, in each permutation, the number of eventswas kept constant. With the permutated endpoints, we also builtNTCP models and calculated their prediction performance. Theprediction performance of all these models resulted in the distri-bution shown in Fig. 4. The significance of the established NTCPmodels was tested by comparison with these permutation-basedNTCP models. The average original prediction performance isshown as a red cross in Fig. 4, and for both measures there is a cleardistinction between the permutation distribution and the red crossperformance. The P value of the permutation testing is <.001,which indicates that the NTCP models obtained by LASSO with10-fold double-cross validation are statistically significant.

Discussion

It is still common practice in radiation oncology and other medicalfields to build only a single model and report model performancewithout addressing model uncertainty. However, as shown inTable 1, in differently subsampled data it is possible to havecompletely different NTCP models with similar predictionperformance. Statistical analysis essentially uses the distributionfrom samples to predict the distribution of an entire population,because whole-population data are not usually available. There-fore, it is important to evaluate the uncertainty of models bysampling, instead of simply presenting a single model and treatingit as a universal or unique model. Double cross-validation is a goodcandidate for model uncertainty evaluation.

In single cross-validation, 1/10 of the data is used for bothmodel optimization and for evaluating prediction performance.Compared with the prediction performance obtained from doublecross-validation, we found that the prediction value of single cross-validation is more optimistic, because single cross-validation doesnot use a separate prediction set. The prediction set is used for both

model optimization and for prediction. This is why the predictionlikelihood (0.63) and AUC (0.86) are higher than in double cross-validation. In double cross-validation, the median of likelihoodwas 0.605 and AUC was 0.825, which indicates that modeling bysingle cross-validation tends to overestimate prediction perfor-mance. For testing prediction performance fairly, we thereforerecommend double cross-validation in NTCP modeling.

It may seem unclear which model is being validated withdouble cross-validation, because the internal loop returns differentoptimized models for different training sets. In this case, thevariability of the model is taken into account in estimating theperformance with double cross-validation. Consequently, indouble cross-validation, the entire model optimization procedureis validated.

Another question that arises when using double cross-validation is how to translate the results into a final NTCPmodel. No consensus currently exists on how to choose the overallmodel based on the models resulting from the inner loop.Therefore, instead of having 1 final model, multiple predictionscan be obtained from many different models that were developedduring the cross-validation procedure. Instead of having a singleprediction of complication, it is interesting to know the variabilityin the prediction when many related models are used. The averageof multiple models may be a better estimate than a singleprediction. This idea refers to aggregating predictors (17). Theaggregation of many predictors can have a smaller variance thana single predictor (18).

0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.60

200

400

600

800

Likelihood

Occ

urre

nce

a)

0.2 0.3 0.4 0.5 0.6 0.7 0.80

100

200

300

400

500

AUC

Occ

urre

nce

b)

Fig. 4. Prediction results based on cross-validation prediction of original labeling compared with the permutated data assessed using (a)likelihood and (b) area under the receiver operating characteristic curve (AUC). The average of prediction performance across all models isshown as a red cross (�) at the far right side of the plot.

Xu et al. International Journal of Radiation Oncology � Biology � Physicse128

In a permutation test, the endpoints of patient outcome (with orwithout complication) are repeatedly removed and randomlyreassigned to samples to create an uninformative data set of thesame size as the data set under study. The following question thenarises: how many permutations are needed? For very small datasets, it may be feasible to perform an exhaustive permutation testin which all possible permutations are considered. However, thenumber of possible permutations quickly increases, even formoderate class sizes. As an alternative, a test can be performedwith only a subset of all permutations. The number of permuta-tions determines accuracy and the lower bound of the P value;with 100 permutations, the lowest possible P value is .01. Becausethe variance of the performance in permutations can be very large,a large number of permutations are needed to obtain a reliableresult. In current research, the number of permutation is setto 1000.

In this study, only xerostomia was used as an endpoint todemonstrate the validation procedure. However, this validationprocedure can also be applied to other radiation oncology datawith other complications. Another consideration is that onlyinternal statistical validation of an NTCP model was discussed, asa first important step for applying NTCP in practice. Furthervalidation, including external validation, biological validation, andclinical validation, will also be necessary for testing the clinicalutility of NTCP models.

Conclusion

Modelling by LASSO with double cross-validation shows thatradiation xerostomia can be predicted by dose to the contralateral/ipsilateral parotid gland, contralateral submandibular gland,

volume of ipsilateral submandibular gland, age, and baselinexerostomia score. Repeated double cross-validation shows thevariability of prediction performance and variable selection. Theapplied performance measures (likelihood, AUC) can define thestatistical significance of the model when compared with theequivalent performance measures of permuted data models.

References

1. Marks LB, Yorke ED, Jackson A, et al. Use of normal tissue

complication probability models in the clinic. Int J Radiat Oncol Biol

Phys 2010;76:S10-S19.

2. Deasy JO, El Naqa I. Image-based modeling of normal tissue

complication probability for radiation therapy. Cancer Treat Res 2008;

139:211-252.

3. Dehing-Oberije C, De Ruysscher D, Petit S, et al. Development,

external validation and clinical usefulness of a practical prediction

model for radiation-induced dysphagia in lung cancer patients.

Radiother Oncol 2010;97:455-461.

4. El Naqa I, Bradley J, Blanco AI, et al. Multivariable modeling of

radiotherapy outcomes, including dose-volume and clinical factors

more option. Int J Radiat Oncol Biol Phys 2006;64:1275-1286.

5. Das SK, Zhou S, Zhang J, et al. Predicting lung radiotherapy induced

pneumonitis using a model combining parametric Lyman probit with

nonparametric decision trees. Int J Radiat Oncol Biol Phys 2007;68:

1212-1221.

6. Egelmeera AGTM, Velazqueza ER, de Jonga JMA. Development and

validation of a nomogram for prediction of survival and local control

in laryngeal carcinoma patients treated with radiotherapy alone:

a cohort study based on 994 patients. Radiother Oncol 2011;100:

108-115.

7. De Ruyck K, Sabbe N, Oberije C, et al. Development of a multi-

component prediction model for acute esophagitis in lung cancer

Volume 84 � Number 1 � 2012 Statistical validation of NTCP models e129

patients receiving chemooradiotherapy. Int J Radiat Oncol Biol Phys

2011;81:537-544.

8. Xu CJ, van der Schaaf A, Schilstra C, et al. Impact of learning

methods on the predictive power of multivariate normal tissue

complication probability models. Int J Radiat Oncol Biol Phys 2012;

82:e677-e684.

9. Rubingh CM, Bijlsma S, Derks EPPA, et al. Assessing the perfor-

mance of statistical validation tools for megavariate metabolomics

data. Metabolomics 2006;2:53-61.

10. Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical

Learning: Data Mining, Inference and Prediction. 2nd ed. New York:

Springer; 2009.

11. Stone M. Cross-validatory choice and assessment of statistical

predictions. J R Stat Soc B 1974;36:111-147.

12. Westerhuis JA, Hoefsloot HCJ, Smit S, et al. Assessment of PLSDA

cross validation. Metabolomics 2008;4:81-89.

13. Good PI. Permutation Tests: A Practical Guide to Resampling,

Methods for Testing Hypotheses. New York: Springer; 2000.

14. Langendijk JA, Doornaert P, Verdonck-de Leeuw IM, et al. Impact of late

treatment-related toxicity on quality of life among patients with head and

neck cancer treated with radiotherapy. J Clin Oncol 2008;22:3770-3776.

15. Tibshirani R. Regression shrinkage and selection via the lasso. J Royal

Stat Soc B 1996;58:267-288.

16. Friedman J, Hastie T, Tibshirani R. Regularization paths for general-

ized linear models via coordinate descent. J Stat Softw 2010;33:1-22.

17. Breiman L. Bagging predictors. Mach Learn 1996;24:123-140.

18. Xu CJ, Hoefsloot HCJ, Smilde AK. To aggregate or not to aggregate

high-dimensional classifier. BMC bioinformatics 2011;12:153.