variable selection techniques implemented procedures of ... · variable selection techniques...

Variable Selection Techniques Implemented Procedures of the SAS Software

Olaf Gefeller1 and Rainer Muche2

IDepartment of Medical Statistics, University of Gottingen, Humboldtallee 32,

W-3400 Gottingen, Germany

2Clinical Documentation, University of Ulm, Schwabstra:6e 13,

W-7900 Ulm, Germany

ABSTRACT

• In

Different variable selection techniques are often employed in practical data analysis as

part of the statistical modelling process to reduce the number of variables to be included

in a 'final' analysis of the relationship under study. For example, in regression-type ap

plications of these techniques it is the goal to obtain a final model fitting appropriately

the observed data which should (i) lead to stable estimates of the model parameters,

(ii) predict accurately future values of the dependent variable based on the knowledge of

all explanatory variables in the model, and (iii) allow a comprehensible interpretation of

the relationship under investigation. The paper describes briefly several popular variable

selection techniques in different statistical contexts and looks for their implementation in

procedures of the SAS software. A complete coverage of all opportunities to use these tech

niques within the SAS software, in particular in the SAS/STAT, SAS/ETS, SAS/OR, and

SAS/QG components, is given. Selection criteria used by the different procedures, statisti

cal details of the involved algorithms and syntactic specifications are compared. Finally, a

critical discussion of the present status of implementation of variable selection techniques

in the SAS software from a statistical point of view is provided.

1. INTRODUCTION

In a variety of different contexts empirical studies are often devoted to the analysis of

the relationship between some response variable and a multitude of explanatory variables

potentially influencing the dependent variable. Mostly, information on numerous explana

tory variables is gathered, and during the phase of data analysis these are 'statistically

screened' to find the most important ones for an appropriate description of the rela

tionship. In this situation variable selection techniques are routinely employed as part of

801

statistical modelling of the observed data to reduce the number of explanatory variables

and to find the best-fitting subsets of variables, respectively.

The process of variable selection procedures can be viewed either as part of the exploratory

data analysis or as part of the inferential process providing the basis for decision making.

In the first case the aim is to seek out the most important explanatory variables as

components of the final model providing an adequate description of the relationship under

study. No additional statistical conclusions based on the final model are drawn in this

situation. In the second case variable selection procedures are commonly used to yield

a final model as a basis for further statistical inference. Estimation and prediction are

conducted in the recommended model as if this subset of explanatory variables has been

chosen a priori (without reference to the data). Both cases have to be distinguished in

the discussion of advantages and disadvantages of variable selection techniques as the

special framework for the application of these procedures leads to different statistical

problems. In the first case the analysis is of purely descriptive nature and hence must not

take account of the multitude of statistical significance tests conducted, while the second

approach needs to pay attention to the inflation of the type I error for the whole selection

procedure.

A problem frequently encountered in practical data analysis arises due to multicollinearity

meaning a nearly exact linear relation between potential explanatory variables. It may

result in parameter estimates revealing a high variance which consequently are unreli

able and can be far away from the true value. Some variable selection procedures fail

to recognize special variables or even important factor combinations in the presence of

multicollinearity. Variables that seem important when analyzed alone, may appear non

influential in the presence of others, and conversely, the effect of some variables is only

observed in the presence of others. Therefore, parameter estimates obtained in each model

considered during the selection process may not be reliable, either in magnitude or in di

rection. Usually, in multicollinear data, there is not a unique order of importance of single

variables and hence, variable selection or elimination based on considering only one vari

able at a time may be misleading.

In general, variable selection techniques as part of statistical modelling are applied to

obtain an appropriate model fitting the data at hand which should (i) lead to stable

parameter estimates, (ii) predict accurately future values, and (iii) allow a comprehensible

interpretation since only a few important explanatory variables are selected. The choice

of a particular variable selection algorithm to be employed in practice depends usually

not only on statistical properties but - more importantly - on the availability of the

corresponding software. Since SAS is the leading software package for statistical analyses,

it is important to investigate which types of variable selection techniques are implemented

802

in different components of the SAS software as this will have a major impact on the current

practice of variable selection in applications. This paper serves as a guide to the different

realizations of variable selection techniques in various procedures.

2. STATISTICAL BACKGROUND

2.1. Stepwise Methods

A group of popular and commonly employed approaches to select variables has been

termed stepwise methods; these are described in detail in most textbooks on regression

methods (e.g. Draper and Smith, 1981). Three main techniques can be identified: forward

selection, backward elimination, and stepwise regression. These procedures add and/or

delete variables according to some special criterion, for example, a statistical test of the

estimated model parameter corresponding to the variable under study.

The forward selection procedure starts with an 'empty' regression equation and succes

sively adds one variable at a time until all explanatory variables are selected or until a

stopping rule is satisfied. The criterion for variable selection relates to the value of the test

statistic for the single parameter in the regression model including all variables already

entered and the variable under consideration. At any step the variable with the largest

value of the test statistic is added to the current regression equation if its value exceeds

a prespecified critical bound depending on the chosen significance level. Otherwise, the

procedure stops, and the current regression model is referred to as the final model.

In contrast, the backward elimination procedure begins with the full model including all

explanatory variables which are eliminated one at a time. A variable is excluded from

the regression equation at any step if its test statistic derived from a regression model

consisting of all non-eliminated variables has the smallest value and does not exceed a

specified critical bound. If there is no further variable fulfilling the elimination criterion,

the current model is called final model.

Stepwise regression is a procedure combining both techniques. It is a forward orientated

approach incorporating partly the backward idea. The procedure starts like forward selec

tion including one variable at a time, but after each selection step an additional elimination

step is inserted. Thus, it is possible to exclude variables added previously if these variables

are less influential in the current model. The procedure yields the final model when no

further inclusions or exclusions of explanatory variables are necessary and possible.

The stepwise methods attempt to obtain a final model by employing similar statistical

techniques, but they approach the problem from different directions. Although they select

803

variables by applying similar criteria, they need not yield identical final models. Illustra

tive examples have been provided in the literature that forward selection and backward

elimination can lead to drastically different results (McGee et al., 1984). It is even pos

sible that the first variable entered in forward selection is the first variable deleted in

backward elimination (Gunst and Mason, 1980). Difficulties may arise in model building

when a variable selected at the beginning will be unnecessary or an effect of a variable

is only observed in the presence of others. Obviously, under such circumstances stepwise

methods cannot lead to equal and reliable final models (Hocking, 1976). In addition, im

portant models may be overlooked by all stepwise techniques because of the restriction of

considering only one explanatory variable at a time (Mantel, 1970).

Since variable selection and elimination are subject to the prespecified significance level,

this leads to another criticism of stepwise methods. Strategies that use a significance level

for entering variables ignore the number of multiple comparisons actually made (Harrell

et al., 1984). Employing statistical tests with a prespecified significance level protects at

each selection or elimination step against erroneously including an in fact non-influential

variable, but there is no protection against the overall error of the inclusion of at least

one variable in the final model which actually has no effect on the response variable (see

e.g. Aitkin, 1971).

2.2. Other Methods

Although stepwise methods represent the most popular group of variable selection pro

cedures (at least in regression-type applications), the repertory of computational tools

attempting to find the important variables for a final analysis of the observed data is not

nearly limited to this group. A variety of other variable selection procedures have been

proposed in the literature, for an overview of regression-based methods like, for example,

best subset selection or all possible regressions see Draper & Smith (1981).

For medical applications Harrel et al. (1984) has recommended incomplete principal com

ponents regression to improve the predictive quality of the final model. Principal compo

nents regression poses special restrictions on the parameters leading to a reduction in the

number of parameters to be estimated.

Another variable selection technique, the CART method (classification and regression tree)

originally introduced by Breiman et al. (1984), has recently gained popularity in data

analysis. Implementation of CART leads to the identification of hierarchically ordered

homogeneous subgroups defined by important explanatory variables.

The tree structure testing (TST) strategy (Commenges et al., 1989) is a further variable

selection algorithm based on significance tests on groups of explanatory variables ordered

804

I i ~. ~

in a tree structure. The procedure resembles Fisher's least significant difference test in

troduced in the context of the analysis of variance (Fisher, 1935). Its principal advantage

lies in the opportunity to control the type I error of the whole seleCtion procedure, i.e.

the probability of selecting at least one variable which actually has no influence on the

response. A lot of multiple comparison procedures can be applied to the tree of hypotheses

to construct such a strategy holding the multiple significance level, for details see Gefeller

& Kron (1992).

In epidemiologic applications another procedure for variable selection, termed change-in

estimate method, is extensively used for the purpose of reducing the number of variables

to be included in the final model (Greenland, 1989). This method is based on the sub

jective evaluation of changes in the parameter estimate of the explanatory variable of

primary interest from model to model. Thus, this technique is only applicable in special

situations when the interest lies in analyzing the relationship between the response and

one independent variable controlled for the confounding influence of other variables to be

identified by the change-in-estimate method.

3. IMPLEMENTATION IN PROCEDURES OF THE SAS SOFTWARE

This section contains a brief overview of all realizations of variable selection techniques

in procedures of the SAS software. Although we screened all different SAS components

which could, at least theoretically, cover procedures incorporating some form of variable

selection (SASjSTAT, SASjETS, SASjOR, SASjQC), we found such procedures only

in the SASjSTAT component (SAS Institute Inc., 1990). Table 1 presents a list of all

these procedures and further indicates which type of selection technique is offered by the

procedures. Afterwards the different SAS procedures are briefly explained with respect to

their capability of performing automatic variable selection.

Table 1: Procedures of the SASjSTAT software offering some form of variable selection

Name of Type of variable selection

SAS procedure stepwise methods best subset selection other techniques

PROC REG x x x

PROC LOGISTIC x x -

PROG PHREG x ·x -

PROC STEPDISC x - -

PROC VARCLUS - - x

805

. '. ". ~ - . -- -- ., ~ -," . ~ ~ •• c. _ J'_' ____ _

:., ....

PROC REG: a general procedure for linear regression modelling. Several different tech-

niques for automatic variable selection are implemented in PROC REG. All stepwise

methods (forward selection, backward elimination, stepwise'regression) are covered.

In this context, the selection criterion relates to the p-value of the F -statistic for

an explanatory variable which reflects the variable's contribution to the linear re

gression model. The user can specify bounds for the p-value (SLENTRY, SLSTAY)

to control the variable selection process. Best subset selection methods are imple-

mented employing R2, adjusted R2 and Mallow's Cp as measures to judge the model

fit leading to the selection of the best subset of explanatory variables of a given size.

Two additional selection procedures (MAXR, MINR) are included which imitate a

best subset selection based on R2; however, these procedures are computationally

faster at the expense that they may overlook the best subset.

PROC LOGISTIC: a procedure for nonlinear regression modelling of categorical data

utilizing the logit transformation of the response probabilities. Four different tech

niques for automatic variable selection are implemented in PROC LOGISTIC. All

stepwise methods (forward selection, backward elimination, stepwise regression) are

covered. In this context, the selection criterion relates to the p-value of the ad

justed Wald chi-square statistic for an explanatory variable which reflects the vari

able's contribution to the logistic regression model. The user can specify bounds

for the p-value (SLENTRY, SLSTAY) to control the variable selection process. In

SAS/STAT version 6.07, best subset selection has been implemented based on the

likelihood score statistic as a measure to judge the model fit leading to the selection

of the best subset of explanatory variables of a given size. This selection method

uses the branch and bound algorithm proposed by Furnival and Wilson (1974) to

speed up the computational process.

PROC PHREG: a procedure for analyzing censored failure times according to the

semiparametric proportional hazards regression model proposed by Cox (1972). This

procedure has been included in the SAS/STAT software in version 6.07. Four differ

ent techniques for automatic variable selection are implemented in PROC PHREG.

All stepwise methods (forward selection, backward elimination, stepwise regression)

are covered. In this context, the selection criterion relates to the p-value of the

adjusted Wald chi-square statistic for an explanatory variable which reflects the

/ variable's contribution to the proportional hazards regression model. The user can

specify bounds for the p-value (SLENTRY, SLSTAY) to control the variable se

lection process. Best subset selection is implemented based on the likelihood score

statistic as a measure to judge the model fit leading to the selection of the best sub

set of explanatory variables of a given size. This selection method uses the branch

806

i··

and bound algorithm proposed by Furnival and Wilson (1974) to speed up the com

putational process.

PROC STEPDISC: a procedure specifically designed to perform stepwise methods in

the context of discriminant analysis. The procedure selects variables to produce a

discrimination model that can be useful for discriminating between several classes.

All stepwise methods (forward selection, backward elimination, stepwise regression)

are covered. In this context, two selection criteria are offered. One relates to the

p-value of the F-statistic for an explanatory variable in an ANCOVA model taking

all variables already chosen as covariates and the specific variable under consider

ation as the dependent factor in the model. The other criterion uses the partial

correlation coefficient for predicting the variable under study. The user can spec

ify bounds for the p-value (SLENTRY, SLSTAY) and for the partial correlation

coefficient (PR2ENTRY, PR2STAY) to control the variable selection process.

PROC VARCLUS: a procedure performing cluster analysis on sets of variables. The

clusters are chosen to maximize the variation accounted for by either the first prin

cipal component or the centroid component of each cluster. Thus, the procedure

can be used to reduce the number of variables in an analysis employing selection

criteria concerning either first principal component or centroid components of the

clusters. The user can specify the percentage of variation that must be explained

by the cluster component or the largest permissible value of the second eigenvalue

in the cluster components to control the variable selection process. However, it is

evident that variable selection in cluster analysis is based on a completely different

statistical background compared to regression-based applications.

4. DISCUSSION

The practice of statistical modelling of observed data depends critically on the availability

of easy-to-comprehend software covering the necessary computational tasks. In partic

ular, the routine application of variable selection methods, which usually involve rather

complex computational operations, is only feasible if appropriate software assists the user

during this phase of data analysis. This paper has documented the current situation of

implementation of variable selection techniques in the SAS software, the major inter

national software package for statistical data analysis. The synopsis of all implemented

selection algorithms has revealed that the current focus of the SAS software lies in step

wise methods and best subset selection. Other important techniques like, for example,

807

i,

'.'

the CART method or the TST strategy, have been completely neglected, although these

procedures are popular tools in a variety of applications. Future developments in the SAS

software should take care of this deficit and close the gap between users' demands of more

implemented variable selection techniques and the current SAS software reality.

A solution to the problem of constructing an appropriate variable selection algorithm

applicable to all situations of variable selection in data analysis constitutes a challenging

statistical problem for which an accepted standard solution is not apparent. The different

strategies exhibit positive as well as negative properties depending on the demands raised

under specific circumstances. Consequently, no variable selection technique can be recom

mended for general use in all applications. The specific requirements of the application

have to be considered and the decision to use a particular procedure has to be based on

these considerations. The multiple comparison problem in variable selection algorithms,

i.e. the proper statistical control of the overall error of the selection procedure, has to be

taken into account whenever the final model forms the basis of further statistical infer

ence. The common practice to ignore the impact of variable selection procedures on the

statistical error rates needs correction. Finally, it should be recognized in data analysis

that all variable selection techniques must be used extremely carefully, especially in the

situation of multicollinearity frequently encountered in practical applications, as none of

the methods can be guaranteed to yield satisfactory results.

REFERENCES

Aitkin, M.A. (1971). Statistical theory (behavioural science application). Ann. Rev. Psy

cho!. 22, 225-250.

Breiman, L., Friedman, J.H., Ohlsen, R.A., Stone, C. (1984). Classification and regression

trees. Wadsworth, Belmont CA.

Commenges, D., Dartigues, J.F., Peytour, P., Puymirat, E., Henry, P., Gagnon, M. (1989).

A strategy for analysing multiple risk factors with application to cervical pain syndrome.

Meth. Inform. Medicine 28, 14-19.

Cox, D.R. (1972). Regression models and life tables (with discussion). J. Royal Stat. Soc.

B. 34, 187-220.

Draper, N.R., Smith, H. (1981). l'l.pplied regression analysis, (second edition). John Wiley,

New York.

Fisher, R.A. (1935). The design of experiments. Oliver & Boyd Ltd., Edinburgh.

808

Furnival, G.M., Wilson, R.W. (1974). Regression by leaps and bounds. Technometrics 16,

499-511.

Gefeller, 0., Kron, M. (1992). Controlling the multiple level of significance in variable

selecting algorithms: an improved version of the TST strategy. In MEDICOMP '92 -

Application of computational and cybernetic methods in medicine and biology, eds. T. Asztalos, J. Eller and I. Gyori, pp. 101-108. SZOTE, Szeged Hungary.

Greenland, S. (1989). Modelling and variable selection in epidemiologic analysis. Am. J. Public Health 79, 340-349.

Gunst, R.F., Mason, R.L. (1980). Regression analysis and its application. Marcel Dekker,

New York.

Harrell, F.E., Lee, K.L., Califf, R.M., Pryor, D.B., Rosati, R.A. (1984). Regression mod

elling strategies for improved prognostic prediction. Statistics in Medicine 3, 143-152.

Hocking, R.R. (1976). The analysis and selection of variables in linear regression. Tech

nometrics 18, 425-438.

Mantel, N. (1970). Why stepdown procedures in variable selection. Technometrics 12,

621-625.

McGee, D., Reed, D., Yang, K. (1984). The results of logistic analyses when the variables

are highly correlated: An empirical example using diet and CHD incidence. J. Chron. Dis.

37, 713-719.

SAS Institute Inc. (1990). SAS/STAT User's Guide,4th Edition. SAS Institute Inc., Cary,

NC.

Address for correspondence:

Dr. Olaf Gefeller

Abteilung Medizinische Statistik

Georg-August-Universitat Gottingen

Humboldtallee 32

W-3400 Gottingen

Germany

SAS, SAS/STAT, SAS/ETS, SAS/OQ, and SAS/QC are registered trademarks of

SAS Institute Inc., Cary, NC, USA.

809

variable selection techniques implemented procedures of ... · variable selection techniques...

Documents