properties of prediction sorting

11
JOURNAL OF CHEMOMETRICS J. Chemometrics 2004; 18: 92–102 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.852 Properties of prediction sorting Ingunn Berget 1,2 * and Tormod Nȕs 1,3 1 MATFORSK,Oslovn1, N-1430 —s,Norway 2 Department of Animal and Aquacultural Sciences, Agricultural University of Norway, Norway 3 Department of Mathematics, Statistics Division, University of Oslo,Oslo, Norway Received 29 May 2003; Revised 20 March 2004; Accepted 20 March 2004 One of the major sources of unwanted variation in an industrial process is the raw material quality. However, if the raw materials are sorted into more homogeneous groups before production, each group can be treated differently. In this way the raw materials can be better utilized and the stability of the end product may be improved. Prediction sorting is a methodology for doing this. The procedure is founded on the fuzzy c-means algorithm where the distance in the objective function is based on the predicted end product quality. Usually empirical models such as linear regression are used for predicting the end product quality. By using simulations and bootstrap- ping, this paper investigates how the uncertainties connected with empirical models affect the optimization of the splitting and the corresponding process variables. The results indicate that the practical consequences of uncertainties in regression coefficients are small. Copyright # 2004 John Wiley & Sons, Ltd. KEYWORDS: raw material variability; sorting; robustness; fuzzy clustering 1. INTRODUCTION Raw materials of different quality will in practice often need different process settings to obtain the same end product quality. Therefore variable raw material quality is a potential problem in commercial industry. One approach to this problem is to sort the raw materials into more homogeneous categories and process them with optimal process settings for each category [1–4]. In this way the raw materials can be better utilized. In addition, there will be less need for frequent process adjustment, because only one category will be processed at a time. Figure 1 shows an overview of this methodology. The basic methodology is presented in References [1,2], the methodology is extended to the multiresponse situation in Reference [3] and modifications of the methodology for sorting raw materials into different product categories are proposed in Reference [4]. The cluster analysis used to identify the raw material categories and the corresponding process settings is a fuzzy cluster algorithm based on FCM (fuzzy c-means) [5]. Other algorithms may also be envi- sioned; see e.g. Reference [6]. The inputs to the clustering method in Figure 1 are: a model for predicting the end product quality; the target value for the end product; data for the raw materials to be sorted; and the number of categories into which the raw materials need to be sorted. Usually empirical models must be applied to predict the end product quality; in most situations, polynomial models are adequate for this purpose. The outputs are membership values that are used to split the raw materials into different categories, and optimal process settings for each category. Both the division of the raw materials and the process settings for each category are optimized in order to make the average predicted end product quality as close to the target as possible for all groups. The sorting procedure is therefore referred to as prediction sorting. This type of sorting has been shown to reduce the average loss substantially [1–4], but, since the methodology is new, there is a need for a better under- standing of the method. The purpose of the present paper is to contribute to the acquisition of this type of insight. Especially we investigate the robustness of the clustering and the optimization of the process variables to random variation in the estimated regression coefficients and model choice. In addition, convergence properties of the methodol- ogy are discussed. The target, the raw materials to be sorted and the number of groups are considered to be fixed. We confine ourselves to study problems with only one response and two groups. The outline of the paper is as follows. In Section 2 a brief description of the methodology is given. In Section 3 a simulation study with constructed models is presented. The aims of this study are to compare robustness properties of different models, with a special focus on first-order models versus models with second-order and interactions terms. In Section 4 it is shown how bootstrapping can be used to assess the uncertainty of the optimal process settings in real problems. Simulations are used to evaluate the proposed bootstrap procedure. In Section 5 the problem of *Correspondence to: I. Berget, MATFORSK, Oslovn 1, N-1430 A ˚ s, Norway. E-mail: [email protected] Copyright # 2004 John Wiley & Sons, Ltd.

Upload: nofima

Post on 31-Jan-2023

1 views

Category:

Documents


0 download

TRANSCRIPT

JOURNAL OF CHEMOMETRICSJ. Chemometrics 2004; 18: 92–102Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.852

Properties of prediction sorting

IngunnBerget1,2* and TormodN�s1,31MATFORSK,Oslovn1, N-1430—s,Norway2Departmentof AnimalandAquacultural Sciences,Agricultural Universityof Norway,Norway3DepartmentofMathematics,StatisticsDivision,UniversityofOslo,Oslo,Norway

Received 29 May 2003; Revised 20 March 2004; Accepted 20 March 2004

One of the major sources of unwanted variation in an industrial process is the rawmaterial quality.

However, if the raw materials are sorted into more homogeneous groups before production, each

group can be treated differently. In this way the raw materials can be better utilized and the

stability of the end product may be improved. Prediction sorting is a methodology for doing this.

The procedure is founded on the fuzzy c-means algorithm where the distance in the objective

function is based on the predicted end product quality. Usually empirical models such as linear

regression are used for predicting the end product quality. By using simulations and bootstrap-

ping, this paper investigates how the uncertainties connected with empirical models affect the

optimization of the splitting and the corresponding process variables. The results indicate that the

practical consequences of uncertainties in regression coefficients are small. Copyright # 2004 John

Wiley & Sons, Ltd.

KEYWORDS: raw material variability; sorting; robustness; fuzzy clustering

1. INTRODUCTION

Raw materials of different quality will in practice often need

different process settings to obtain the same end product

quality. Therefore variable raw material quality is a potential

problem in commercial industry. One approach to this

problem is to sort the raw materials into more homogeneous

categories and process them with optimal process settings

for each category [1–4]. In this way the raw materials can be

better utilized. In addition, there will be less need for

frequent process adjustment, because only one category

will be processed at a time.

Figure 1 shows an overview of this methodology. The

basic methodology is presented in References [1,2], the

methodology is extended to the multiresponse situation in

Reference [3] and modifications of the methodology for

sorting raw materials into different product categories are

proposed in Reference [4]. The cluster analysis used to

identify the raw material categories and the corresponding

process settings is a fuzzy cluster algorithm based on FCM

(fuzzy c-means) [5]. Other algorithms may also be envi-

sioned; see e.g. Reference [6].

The inputs to the clustering method in Figure 1 are: a

model for predicting the end product quality; the target

value for the end product; data for the raw materials to be

sorted; and the number of categories into which the raw

materials need to be sorted. Usually empirical models must

be applied to predict the end product quality; in most

situations, polynomial models are adequate for this purpose.

The outputs are membership values that are used to split the

raw materials into different categories, and optimal process

settings for each category. Both the division of the raw

materials and the process settings for each category are

optimized in order to make the average predicted end

product quality as close to the target as possible for all

groups. The sorting procedure is therefore referred to as

prediction sorting. This type of sorting has been shown to

reduce the average loss substantially [1–4], but, since the

methodology is new, there is a need for a better under-

standing of the method. The purpose of the present paper is

to contribute to the acquisition of this type of insight.

Especially we investigate the robustness of the clustering

and the optimization of the process variables to random

variation in the estimated regression coefficients and model

choice. In addition, convergence properties of the methodol-

ogy are discussed. The target, the raw materials to be sorted

and the number of groups are considered to be fixed. We

confine ourselves to study problems with only one response

and two groups.

The outline of the paper is as follows. In Section 2 a brief

description of the methodology is given. In Section 3 a

simulation study with constructed models is presented.

The aims of this study are to compare robustness properties

of different models, with a special focus on first-order

models versus models with second-order and interactions

terms. In Section 4 it is shown how bootstrapping can be

used to assess the uncertainty of the optimal process settings

in real problems. Simulations are used to evaluate the

proposed bootstrap procedure. In Section 5 the problem of

*Correspondence to: I. Berget, MATFORSK, Oslovn 1, N-1430 As,Norway.E-mail: [email protected]

Copyright # 2004 John Wiley & Sons, Ltd.

model selection is illustrated with data from a baking

experiment, and the bootstrap methodology from Section 4

is used to estimate the standard deviation of the optimal

process variable.

2. INTRODUCTION TO PREDICTIONSORTING

2.1. Sorting procedureIt is assumed that a model for predicting end product

quality (y) from raw material (z) and process (x) variables

is available. This model is written as yy ¼ fðx; zÞ and is

assumed to be a polynomial in x and z (see e.g. Equations

(7) and (10)–(13)). Here bold letters indicate vectors such that

x ¼ ½x1; x2; . . . ; xkx� and z ¼ ½z1; z2; . . . ; zkz�. Note, however,

that x and z could also be univariate, in which case italics

will be used, i.e. x and z.

The general sorting procedure described below aims at

finding the best way of partitioning a fixed set of raw

materials with N objects into C different categories with

possibly different process settings. The partitioning is opti-

mized with the purpose of minimizing the loss, which is the

average squared deviation between the predicted end pro-

duct quality and a predefined target value (T). When each

group of raw materials is processed with different process

settings, the distance between object i and category j can be

defined as

d2ij ¼ ð fðzi; x�j Þ � TÞ2 ¼ ðyyij � TÞ2

i ¼ 1; . . . ;N; j ¼ 1; . . . ;Cð1Þ

Here yy is the end product quality and dij is the loss from

object i when it is processed with the settings for the jth

category.

The optimal partitioning of a fixed set of raw materials, Z,

is obtained by minimizing

Jðm;ZÞ ¼XCj¼1

XNi¼1

umij d2ij subject to

XCj¼1

uij ¼ 1 ð2Þ

Here uij is the membership of object i in category j, which

is a relative measure of how much object i belongs to the

jth cluster. The membership values are between one and

zero. The parameter m determines the fuzziness of the

clustering. Usually m is equal to two for fuzzy cluster-

ing [7]. If m is set to be equal to one, on the other hand, a

crisp clustering with uij equal to either one or zero is

obtained.

The objective function J is minimized with respect to

D ¼ fd2ijg for given U ¼ fuijg and vice versa. The algorithm

for minimizing J is as follows.

1. Initialize U either randomly or according to prior knowl-

edge.

2. Optimise x for given U and calculate the distances in

Equation (1). The optimal x values are found from

xj ¼ argminx�

Xni¼1

umij ð fðzi; x�j Þ � TÞ2

!ð3Þ

3. Update membership values according to

uij ¼XCt¼1

d2ij=d

2it

!m�1

ð4Þ

4. Check convergence. A convenient stopping criterion is to

compare the criterion function with the value obtained in

the previous iteration: in this case, stop if jJnew � Joldj < ",

else go to step 2. However, other stopping criteria can also

be applied.

When the procedure has converged, the raw material

categories are defined by allocating each object to

the cluster where it has the largest membership value.

For more details about the methodology see References

[1–4].

Note that z is measured on a continuous scale and that

there may not necessarily be any natural grouping in the

data that need to be sorted. This is different from the

regular use of clustering, where the focus is to find groups

in data. The data in Z can be measurements for the raw

materials to be sorted, or Z can consist of simulated data

points reflecting the expected distribution of the incoming

raw materials.

In practice, the sorting methodology could either work

as a calibration method to construct a classification rule,

or the membership values could be calculated for new

samples directly. In the latter case, dij is calculated with

the optimal settings obtained from sorting of historical

data or artificial data generated inside the region of

interest from a distribution reflecting the expected dis-

tribution of the incoming raw materials.

2.2. Cluster validity and evaluationThroughout this paper we will use the average loss for

evaluation of the clustering. This is directly interpretable

since it focuses on the end product quality. The expected

Figure 1. Overviewof thesortingmethodology.

Properties of prediction sorting 93

Copyright # 2004 John Wiley & Sons, Ltd. J. Chemometrics 2004; 18: 92–102

loss obtained with sorting will be compared with the loss

obtained without sorting and is estimated as

EEðlossÞ ¼ 1

N

XNi

ðyyðiÞ � TÞ2 ð5Þ

where yyðiÞ is the predicted end product quality of object i

when it is assigned to the category where it has the highest

membership value. Without sorting, all objects are allocated

to one category.

When it is natural to compare the loss for different models,

we use the so-called gain, which is calculated as a percentage

of the loss without sorting, i.e. as

G ¼ 100 1 � EEðlossÞsorted

EEðlossÞnotsorted

!ð6Þ

Note that other aspects, e.g. the cost associated with

sorting, may also be relevant to take into account when a

concrete decision has to be made.

3. SIMULATION STUDIES

A simulation study was carried out in order to study the

robustness properties of the discussed methodology. We

wanted to compare the robustness properties of different

types of models; however, we confine ourselves to models

that are univariate in both x and z.

3.1. Constructed modelsIn each simulation, data were generated according to the

statistical model

y ¼ fðx; zÞ þ e ¼ y0 þ e ð7Þ

The random error e is normally distributed with variance �2.

The term y0 denotes the true end product quality, which

would be equal to the measured quality y if there was no

noise present in the data. Nine different models were tested

using the general second-order model expressed as

y ¼ �0 þ �1xþ �2zþ �3xzþ �4x2 þ �5z

2 þ e

¼ y0 þ eð8Þ

In each model the coefficients b ¼ ½�0; �1; . . . ; �5�T are set in

such a way as to cover a wide range of possibilities. The nine

different models are presented in Figure 2.

It is assumed that the interval ½�1:0; 1:0� defines the region

of interest for both variables and that the regression coeffi-

cients b ¼ ½�0; �1; . . . ; �5�T are unknown and must be esti-

mated from data. The regression coefficients are estimated

by least squares (LS).

Figure 2 shows both the equations for the models and

the response surface for the loss function, which is

L ¼ ðfðx; zÞ � TÞ2. For each model the target T is equal to

�0, which is the y value at the centre of the simulated design.

The models are numbered 1–9. The first three models

include only first-order terms in both x and z (i.e.

Figure 2. Contour plots of the loss function for the constructedmodels.The thick lines show theminimumloss for all z.Note that inmodel 3 theminimum line is on the borders of the investigated region, for lowandhighz.Thisisalso the case for low z inmodel 6.Inmodels 6 and 9 thereexist noxvalueswhichgivezero lossforhighz.

94 I. Berget and T. Næs

Copyright # 2004 John Wiley & Sons, Ltd. J. Chemometrics 2004; 18: 92–102

�3 ¼ �4 ¼ �5 ¼ 0), whereas the rest are full, second-order

models. The models beneath each other have the same

coefficients for the first-order terms and increasing coeffi-

cients for the second-order terms. In the first column the

effects of x and z are of the same size. In the second column

the effect of x is larger than that of z, whereas in the third

column the effect of z is much larger than that of x. To

exaggerate effects, the non-linear terms are made higher than

they will typically be in most practical situations.

3.2. SimulationsFor each model, data were generated according to a repli-

cated, full factorial design in x and z. The levels were

f�1; 0; 1g for both variables, and the whole design was

replicated three times. This resulted in 3 � 3 � 3 ¼ 27 obser-

vations for each data set. The true y value (y0) was calculated

for all points in the design. Then normally distributed error

was added to y0 according to the statistical model described

by Equation (7) with � ¼ f1; 2; 4; 6; 8; 10g. In total, 100 in-

dependent data sets with 27 observations were generated for

each model at each level of �. For each data set the LS

estimate of b was obtained. For models 1–3 the regression

coefficients were estimated by assuming both the full model

in Equation (8) and the model with �3 ¼ �4 ¼ �5 ¼ 0.

A summary of the R2 between the predicted and the

observed y is given in Table I. The smallest R2 obtained is

approximately 0.75. We consider that for models with lower

R2 (larger �) the sorting algorithm will not be relevant

because of poor prediction ability.

The sorting procedure was carried out for both the true

and the estimated models. The N data points (‘batches’) to

be sorted were set to be Z ¼ ½�1:00;�0:99; . . . ; 0:99; 1:00�, i.e.

points uniformly distributed over the region of interest. The

sorting algorithm was applied to sort these points into C ¼ 2

groups. In the optimization of the process variables, x was

constrained to be within ½�1:0; 1:0�.

3.3. Results3.3.1. Convergence propertiesWhen the sorting algorithm was applied to the true models,

it converged in 5–15 iterations, and there was no sensitivity

to different initializations. When applied to the estimated

models, it converged in less than 40 iterations for all simula-

tions. The first-order models converged faster than the

second-order models.

Generally the procedure was insensitive to the initializa-

tion of U. However, it was observed that for some data sets

of models 3, 5 and 9 a few initializations led to uij ¼ 0:5 for all

i and j. This complete fuzzy solution cannot be used to

partition Z; nevertheless, it is always a stable solution

because it gives di1 ¼ di2 for all i. Hence there will be no

change when u and d are updated in steps 2 and 3 of the

algorithm. Although no sensitivity to different initializations

has been detected in previous work, or for the other models,

these observations show that the sorting procedure may not

be completely insensitive to the starting point. Therefore

different initializations should always be tested before the

final conclusions are drawn.

In approximately 5% of the simulations of model 6 the

complete fuzzy solution was obtained not only for a few but

for all initializations of U when there were constraints on x.

However, when the constraints were removed, a partitioning

of Z could be obtained in all simulations. This shows that the

partitioning is affected by active constraints on the process

variables and that with such constraints a partitioning may

be difficult to obtain, because the process variables that

minimize the loss for a group of raw materials are outside

the feasible region.

3.3.2. Membership valuesThe membership values for the true models are shown in

Figure 3.

For the linear models the splitting point equals zero, which

is the mean of Z. When there are no active constraints on x,

there is no variation in the membership values. With model 3

there are active constraints, hence the membership values

are variable. If the full model in Equation (8) is assumed, the

membership values are variable for all three linear models.

These results are confirmed in unpublished results and are

expected to hold for all linear models when the points in Z

are symmetrically distributed.

For the non-linear models, i.e. the models where second-

order and interaction terms are added, the splitting point

is shifted away from zero. In extreme cases, second-order

terms may lead to several splitting points (model 8,

Figure 3). For all the non-linear models the membership

values varied between different simulations. Generally

points close to the splitting point or other points with

membership values close to 0.5, e.g. the end points, had

more variable membership values than the rest. Moreover,

the variability of the membership values was smaller for

models in the third row than for those in the second row.

This means that high second order terms may stabilize the

membership values.

Although the membership values varied between the

simulations for most models, this variation had in most

cases little influence on the splitting point, as can be seen

in Figure 4.

The misclassification rates, estimated as the fraction of

points classified differently than with the true model in each

simulation, are less than 5% for all models except models 5

and 6. These two models also have, together with model 2,

higher standard deviations of the splitting point (Figure 4).

For the other models, however, the small standard deviation

and bias between true and average splitting point and the

small error rate indicate that for this type of data the splitting

can be considered to be robust towards noise in the data used

to estimate the prediction model.

Table I. Summary of R2 results for simulated data.Maximumandminimumtakenovermodels.Average takenoverallsimulations

� ¼ 1 � ¼ 2 � ¼ 4 � ¼ 6 � ¼ 8 � ¼ 10

Maximum 0.9988 0.9947 0.9808 0.9575 0.9263 0.8901average R2

Minimum 0.9972 0.9886 0.9572 0.9093 0.8480 0.7913average R2

Maximum 0.0008 0.0038 0.0154 0.0321 0.0490 0.0617standarddeviation of R2

Properties of prediction sorting 95

Copyright # 2004 John Wiley & Sons, Ltd. J. Chemometrics 2004; 18: 92–102

3.3.3. Robustness of optimal xWhen there were no active constraints on x, the optimal

x in each group was approximately normally distributed

(Kolmogorov–Smirnov test). However, for the non-linear

models the variance is different for the two groups. The

mean values and standard deviations are presented in

Figure 5. The standard deviation of the optimal x increases

as the error variance � increases. This is most evident when

the full model is assumed when estimating model 3, and for

one group with model 6. Also for model 3 there is a bias in the

optimal x when � increases. The reason for this is that, as the

error variance increases, the constraints on x become more

relaxed. The difference between the optimal x for the two

groups depended on the linear effect of z relative to that of x.

The difference is smallest when the linear effect of z is small

compared with that of x (models in the second column), and

largest when the linear effect of z is large compared with that

of x (models in the third column). Note that for models 2 and

5 the values of average optimal x� one standard deviation for

the two groups overlap when � increases.

3.3.4. GainSorting reduces the loss in all cases. The absolute loss is

approximately the same for the models in the first and the

third column. The loss for the models in the second column

is nearly 40 times smaller. The gain is approximately 75%

and 50% for the models in the first two columns and the last

column respectively (Figure 6). If both gain and absolute loss

are considered, the potential of the sorting methodology is

best for situations similar to the models in the first column,

i.e. when the effects of process and raw material variables are

of the same size.

As can be seen in Figure 6, the average gain is generally

close to the gain obtained with the true models. In model 3,

however, the gain is underestimated when the noise level

increases. Again this is related to the fact that the constraints

on x are active. Figure 6 also shows that the gain is most

variable when there is a large effect of raw material variables

compared with that of process variables, especially for large

�. When looking at the standard deviations, it is clear that in

some situations the gain is sensitive to noise in the data used

to estimate the regression coefficients, whereas in other cases

such noise has no influence on the expected gain.

3.4. SummaryIn the presence of normally distributed errors in the data

used to estimate the prediction model, the membership

values, optimal x values and the gain estimated by the

prediction sorting algorithm have random variation.

In most of the cases investigated, the effects on the final

splitting point and the misclassification rate are negligible.

The variances of the optimal x are different for the various

groups when there are second-order or interaction terms in

the prediction model. Generally the variance of the optimal x

values is small, except for cases where the linear effect of x is

large compared with that of z.

On average the gain is between 50 and 75% for the

studied models. However, the variance of the gain is

Figure 3. Membership curves for the true models.The y-axis represents the membership values and thex-axis represents thevaluesof the zvariable.Fullanddottedlines representmembershipvalues forgroups1and2 respectively.

96 I. Berget and T. Næs

Copyright # 2004 John Wiley & Sons, Ltd. J. Chemometrics 2004; 18: 92–102

quite large for models where there is a large effect of z

compared with that of x. This means that in such situations

it may be difficult to obtain a good estimate of the expected

gain.

Active constraints on x increase the sensitivity to random

error. The reason for this is that the constraints that are active

for the true model may or may not be active for the estimated

model. If both the upper and lower constraints are active, as

in model 3, the raw material variability is too large to obtain

good process settings for either of the groups. However,

sorting will improve the product quality because it allows

for different settings for each subgroup. If there is only one

active constraint for both groups, the raw material quality is

generally too poor for the feasible process settings. In any

case, in the presence of active constraints, sorting should not

be applied without precaution.

Models 2 and 5 are not appropriate for sorting because of

the small difference between the optimal x values relative to

their standard deviation (Figure 5). This is also to be ex-

pected for other models where the effect of x is large

compared with that of z. The results also show that sorting

should be used with precaution when there are active

constraints on x for one or both of the groups (models 3

and 6).

4. BOOTSTRAPPING FOR STUDYINGROBUSTNESS OF PROCESS VARIABLES

4.1. Bootstrap procedureA possible bootstrap procedure to be used for investigating

the robustness properties of the sorting methodology in real

problems is described below.

Figure 4. Average splitting point and standard deviation for all simulations. The straight line marks thesplittingpoint obtainedwith the truemodel.

Properties of prediction sorting 97

Copyright # 2004 John Wiley & Sons, Ltd. J. Chemometrics 2004; 18: 92–102

Assume that a polynomial model has been fitted to the

available data. Then generate B bootstrap samples by

sampling from the residuals of the original fitted model.

Next estimate the model parameters for each of the B

bootstrap samples and carry out the sorting for all esti-

mated models. This leads to B different estimates of the

membership values and the corresponding optimal process

settings within each category. From these bootstrap esti-

mates, the mean and variance of the process variables can

be estimated for each group. If desired, confidence regions

or intervals can be constructed from the bootstrap results

[8–14]. In addition, plots of all the solutions will be infor-

mative with respect to the sensitivity of the obtained solu-

tion. In practice, it would also be natural to compare the

variability in the optimal settings with the natural noise in

the process variables.

4.2. Testing the bootstrap methodologyA simple simulation study was conducted in order to

evaluate the bootstrap procedure. Model 4 was chosen for

this purpose because of its medium complexity.

Data were first simulated to get an estimate of the true

variance in x. In total, 10 000 simulations were carried out.

The data were generated according to the underlying statis-

tical model (Equation (8)) using � ¼ 8 and the coefficients

defined by model 4. Each simulated data set represents the

possible outcome of collecting a new data set. The model was

fitted to each data set, and the sorting was carried out for all

the estimated models. The mean and standard deviation of

the optimal x in each group are listed in Table II.

For the bootstrap simulations, 100 different data sets were

first generated in the same way as described above. The

regression coefficients were estimated for all data sets. Then

Figure 5. Averageoptimalxandstandarddeviation forallsimulations.Thestraight linesmark truevalues.

98 I. Berget and T. Næs

Copyright # 2004 John Wiley & Sons, Ltd. J. Chemometrics 2004; 18: 92–102

B ¼ 100 bootstrap samples were generated from the resi-

duals of each model, and the regression coefficients were

also estimated for all the bootstrap samples. Then the sorting

procedure was carried out for all estimated models (in total,

100 � 100 different estimates). The mean and standard

deviation over all bootstrap samples for simulation number i

were ði ¼ 1; . . . ; 100Þ

xi�j ¼ 1

B

XBb¼1

xi;�bj ð9Þ

sttdðxjÞi� ¼1

B� 1

XBi¼1

ðxi;�bj � xi�j Þ2 ð10Þ

Here xi;�bj is the optimal x value of group j obtained from the

bth bootstrap sample generated from the ith simulation

ði ¼ 1; . . . ; 100Þ. Equations (9) and (10) estimate EðxjÞ and

stdðxjÞ respectively. The average of the means and the mean

standard deviation are given in Table II together with the

results from the simulations. The standard deviation of the

standard deviations from the bootstrap method, denoted s�,

Table II. Results fromsimulationsandbootstrap

Group 1 Group 2

True value �0.539 0.477Mean of simulations �0.539 0.473Mean of bootstrap �0.536 0.483

Standard deviation simulations 0.0936 0.1875Mean standard deviation bootstrap 0.0860 0.1532Standard deviation of bootstrap 0.024 0.035standard deviations, s�

Figure 6. Average gain and standard deviation for all simulations. The straight lines mark results withtruemodels.

Properties of prediction sorting 99

Copyright # 2004 John Wiley & Sons, Ltd. J. Chemometrics 2004; 18: 92–102

is also given. Table II shows that the overall mean is close to

the true x value in each group but that the mean standard

deviation underestimates the standard deviation of x.

5. BAKING EXAMPLE

In this section the bootstrap procedure is illustrated on a real

data set. To investigate the importance of model selection in

the sorting context, the sorting results from four different

models fitted to the same data will be compared. The data

come from a baking experiment originally presented in

Reference [4] and briefly described below.

5.1. ExperimentalThe objective of this example is to optimize bread loaf

volume of hearth bread, which is bread baked without a

pan, when flours of different quality are used for baking. The

process variable to be optimized is proofing time. This is the

resting time from when the breads are formed until they

are baked. Typically the volume increases as the proofing

time increases; however, if the proofing time is too long, the

breads become flat. The target for bread loaf volume was

therefore set at an intermediate level. Another option would

have been to include both form and volume in the sorting

criterion as in References [3,4].

A mixture design was applied to generate flours of differ-

ent quality. The flour blends were mixtures of two different

wheat varieties and a wheat/bran mix. The wheat varieties

differed in protein content and quality, so the three sources

of variation in this design were bran, protein content and

protein quality. Protein content and quality were, however,

confounded. Therefore only protein content was used to-

gether with bran to describe the raw material quality. The

protein content was measured in per cent. The bran fractions

were determined from the design.

For each blend the proofing time was varied at three levels

(30, 40 and 50 min). Because of practical limitations, proofing

time was nested within dough, such that all doughs were

divided into three pieces which were given different proof-

ing times. The mixtures were only blended once, but two

doughs were made separately from each blend. Therefore

there is also a split-plot structure in the experiment. In the

original experiment, doughs were made both with and

without an additive called DATEM. Here we only consider

breads baked with DATEM. In total, there were 60 observa-

tions for the data set to be used in this paper.

5.2. Fitted modelsAn important part of developing empirical models is to

choose the best model. Many strategies and criteria for

variable selection exist, but the various strategies and criteria

may not always give the same results. Therefore it can

sometimes be difficult to determine which model is best.

Moreover, in some situations the established methods for

variable selection may not be adequate, as in the presented

split-plot example where regular variable selection based on

t-tests is more difficult to justify than in fully randomized

experiments.

Clearly the model choice may have an impact on both the

splitting and the optimal process settings. Four different

models fitted to the data from this experiment are presented

below. Since variable selection as such is not the main focus

here, we have used a simple ad hoc procedure for model

selection based on the R2, the adjusted R2 and residual plots

for the different variables.

Let v be the predicted volume, x the proofing time, z1 the

protein content and z2 the bran fraction of each flour blend.

The predictor variables x, z1 and z2 are centred and scaled.

The selected models were as follows.

A. Linear model:

v ¼ 362:57 þ 25:44xþ 21:28z1 � 30:70z2 ð11Þ

B. Model with only significant terms:

v ¼ 378:15 þ 25:44xþ 21:83z1 � 21:54z2

� 5:07z1z2 � 15:85z22

ð12Þ

C. Full second-order model:

v ¼ 385:83 þ 25:44xþ 21:19z1 � 22:73z2

þ 4:08xz1 � 6:99xz2 þ 2:61z1z2

� 5:02x2 � 3:68z21 � 16:19z2

2 ð13Þ

D. Full second-order model with z32 added:

v ¼ 390:79 þ 25:44xþ 20:86z1 � 32:73z2

þ 4:08xz1 � 6:99xz2 þ 2:36z1z2

� 5:02x2 � 3:93z21 � 25:12z2

2 þ 7:16z32 ð14Þ

The first model (A) is a linear model. The second model (B)

is obtained by fitting a full second-order model, but only the

significant terms are kept in the model. Significance is

determined by regular t-tests, which are not exact in this

case because of the split-plot structure. The third model (C)

is the full second-order model, whereas the fourth model (D)

is the full second-order model with the third order effect of

bran (z2) added. This last term was added because there

were systematic errors in the residuals for the full second-

order model. In Reference [4] the most complex model,

model D, was used. The regression diagnostics for each

model are listed in Table III. Note that, with the exception

of the linear model which has a poorer fit than the rest, all

models have approximately the same performance with

respect to the cross-validated RMSEP.

5.3. Comparison of modelsFor each of the four models, B ¼ 1000 bootstrap samples

with N ¼ 60 observations in each sample were generated

and the regression coefficients were estimated for each

sample. Then the sorting procedure was carried out for all

estimated models. The data to be sorted were selected as a

uniform grid of points generated in the region spanned by z1

Table III. R2, adjusted R2 and calibrated and validated RMSEP(rootmeansquareerrorofprediction) foreachmodel (A^D)

A B C D

R2 0.768 0.893 0.910 0.920Adj R2 0.755 0.883 0.893 0.904RMSEP (cal) 20.61 14.01 12.05 12.05RMSEP (val) 22.38 17.02 16.99 16.40

100 I. Berget and T. Næs

Copyright # 2004 John Wiley & Sons, Ltd. J. Chemometrics 2004; 18: 92–102

and z2 (see Figure 7). Hence the sorting is carried out on a

different data set than the data from the experimental de-

sign. By using a uniform grid of points, it is assumed that the

incoming flours will be evenly distributed within the experi-

mental region. The generated points simulate flours of

different quality varying in protein content/quality and

bran content. They were sorted into two groups and the

target was set equal to T ¼ 370 ml. The proofing time was

restricted to be within the experimental region determined

by the interval [30,50] min.

In approximately 1% of the bootstrap replicates of each

model the sorting procedure did not converge. Otherwise

the sorting algorithm converged in less than 90 iterations

(Table IV). In 3–5% of the bootstrap replicates of the two

most complex models (C and D) the algorithm ended up

with all membership values equal to 0.5 in both groups.

These cases were kept out of the further analyses. A sum-

mary of the results is given in Table IV.

The membership values varied between the bootstrap

replicates and between the four models investigated. Based

on the membership values from all bootstrap simulations,

the grid points could be characterized as belonging to

group 1, group 2 or the splitting region. The splitting region

consists of all points which have a member of the other

group as one of their nearest neighbours in one or more of

the bootstrap replicates. In Figure 7, all the grid points are

shown using different ‘colouring’ for the two groups and a

different symbol for points within the splitting region. The

flours in group 1 have a low bran fraction and a high protein

content, whereas group 2 consists of points representing

flours with a low protein content (left) or a high bran content

(upper points).

The separation obtained with model A was most different

from the others. For the other three models the splitting

regions are similar and overlapping. Based on the results in

Section 3, we believe that the variability in the splitting

regions has small practical consequences.

For a more thorough investigation of the consequences of

the model choice, the bread loaf volume was predicted for all

points in the grid with the partitioning and optimal process

Table IV. Convergenceproperties:minimumandmaximumnumberofiterationsbeforeconvergence.Averagegainobtainedbysorting (%)andoptimalxwithout sortingand for the twogroupswithsorting.Standarddeviationsaregiveninparentheses

Convergence Gain obtained Optimal x Optimal x Optimal xModel properties by sorting (%) without sorting group 1 group 2

A 26–77 69.75 (1.09) 42.60 (0.91) 35.64 (1.26) 47.68 (1.27)B 24–70 57.30 (3.98) 39.69 (0.68) 34.71 (0.86) 45.10 (0.87)C 14–95 62.99 (3.72) 37.84 (1.07) 33.12 (1.58) 46.50 (1.15)D 9–88 64.49 (3.64) 37.12 (1.17) 32.28 (1.49) 47.79 (1.26)

Figure 7. Thegridpoints.Full circlesmarkpointsallocated togroup1forallbootstrapreplicates, weher-eas open circlesmark points allocated to group 2.The dotsmark points that are on the border betweenthe twogroups foroneormorebootstrapreplicates.Foreachmodel theregionwith dotsis referred toasthesplittingregion.

Properties of prediction sorting 101

Copyright # 2004 John Wiley & Sons, Ltd. J. Chemometrics 2004; 18: 92–102

settings obtained in each bootstrap replicate for all four

models. Then yy at each point was averaged over all bootstrap

simulations, and the mean difference between models k and l

was calculated as

mean difference ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

N

XNi¼1

ð�yyyyik � �yyyyilÞ2

vuut ð15Þ

where N ¼ 219 is the number of points in the grid and �yyyyikand �yyyyil are the mean predicted values of point i from models k

and l respectively. The calculated values are listed in Table

V.

In practice, model C or D would have been chosen,

because these models have the largest adjusted R2 and the

smallest RMSEP. The mean error between these two models

is 5.84 (Table V). This means that if the ‘wrong’ model were

chosen, the average error for the predicted values would be

approximately 5.84. For a comparison, the validated RMSEP

values for these models are 16.99 and 16.40. This indicates

that the error of prediction is clearly larger than the potential

error of choosing the wrong model. These two models also

had very similar mean values and standard deviations for

gain and optimal x (Table IV).

The most complex model is perhaps less robust than the

others, because some bootstrap replicates gave solutions that

did not lead to a splitting. Otherwise the robustness of the

membership values and process variables does not seem to

be too much affected by model choice. These results show

that, for models that have approximately the same perfor-

mance with respect to commonly used regression diagnos-

tics, it may not matter too much which model is chosen.

6. CONCLUSIONS

Simulations of constructed models and bootstrapping of

existing data have been applied to investigate robustness

properties of the prediction sorting methodology. The results

show that, when normally distributed error is present in the

data used to estimate the prediction model, both member-

ship values and process variables will be uncertain. How-

ever, the simulations show that the classification error due to

variability in membership values is low. If the effect of the

process settings is large compared with that of the raw

material variables, the variability in the optimal x may be

larger than the difference between the optimal x in the two

groups. The variability of the gain depends on the model

used for prediction, but there is always a positive gain by

sorting, and in some situations the gain is hardly affected at

all. The variability in the gain was largest for models where

the raw materials ðzÞ had a large effect relative to the process

variables ðxÞ.In real problems, bootstrapping may be applied to assess

the uncertainty of the optimal process variables. Simulation

results with a constructed model indicate that the bootstrap

has reasonable properties but underestimates the standard

deviation of the optimal process settings. Bootstrap results

with several models fitted to the same data show that, for

models that have reasonably good performance with respect

to the usual regression diagnostics, the possible error of

choosing the ‘wrong’ model will be small compared with

the prediction error.

AcknowledgementsThe authors would like to thank Dr Øyvind Langsrud

and Dr Gunvor Dingstad for helpful comments on the

manuscript.

REFERENCES

1. Næs T, Mevik BH. The flexibility of fuzzy clusteringillustrated by examples. J. Chemometrics 1999; 13: 435–444.

2. Berget I, Næs T. Optimal sorting of raw materials basedon predicted end product quality. Qual. Eng. 2002; 14:459–478.

3. Berget I, Næs T. Sorting of raw materials with focus onmultiple end product properties. J. Chemometrics 2002;16: 263–273.

4. Berget I, Aamodt A, Færgestad EM, Næs T. Optimalsorting of raw materials for use in different products.Chemometrics Intell. Lab. Syst. 2003; 67: 79–93.

5. Bezdek, JC. Pattern Recognition with Fuzzy Objective Func-tion Algorithms. Plenum: New York, 1981.

6. Rousseeuw PJ, Kaufman L, Trauwaert E. Fuzzy cluster-ing using scatter matrices. Comput. Statist. Data Anal.1996; 23: 135–151.

7. Krishnapuram R, Keller JM. The possibilistic C-meansalgorithm: insights and recommendations. IEEE Trans.Fuzzy Syst. 1996; 4: 385–396.

8. Effron B, Tibshirani RJ. An Introduction to the Bootstrap.Chapman and Hall: New York, 1993.

9. Johnson RA, Wichern DW. Applied Multivariate StatisticalAnalysis (3rd edn). Prentice-Hall: Englewood Cliffs, NJ,1982.

10. Box GEP, Hunter JS. A confidence region for thesolution of a set of simultaneous equations with anapplication to experimental design. Biometrika 1953; 41:190–199.

11. Mevik BH, Færgestad EM, Ellekjær MR, Næs T. Usingraw material measurements in robust process opti-mization. Chemometrics Intell. Lab. Syst. 2001; 55:133–145.

12. Owen A. Empirical likelihood ratio confidence-regions.Ann. Statist. 1990; 18: 90–120.

13. Hall P. On the bootstrap and likelihood-based confi-dence regions. Biometrika 1987; 74: 481–493.

14. Yeh AB, Singh K. Balanced confidence regions based onTukey’s depth and the bootstrap. J. R. Statist. Soc. B 1997;59: 639–652.

TableV. Squareroot of themeansquareddifferencebetweenpre-dictedvolumeswith thedifferentmodelswhensortedandwithop-

timizedproofing timesforeachgroup

Full FullLinear Significant second-order second-order

Model model (A) terms (B) model (C) model with z32 (D)

A 0.00 11.86 13.78 15.16B 11.86 0.00 9.01 11.03C 13.78 9.01 0.00 5.84D 15.16 11.03 5.84 0.00

102 I. Berget and T. Næs

Copyright # 2004 John Wiley & Sons, Ltd. J. Chemometrics 2004; 18: 92–102