stage one: define the research problem draft only · the spss sample problem to demonstrate these...

The SPSS Sample ProblemTo demonstrate these concepts, we will work the sample problem for logistic regression in SPSSProfessional Statistics 7.5, pages 37 - 64. The description of the problem can be found on page 39.

The data for this problem is: Prostate.Sav.

Stage One: Define the Research ProblemIn this stage, the following issues are addressed:

Relationship to be analyzed Specifying the dependent and independent variables Method for including independent variables

Relationship to be analyzed

The goal of this analysis is to determine the relationship between the dependent variable NODALINV(whether or not the cancer has spread to the lymph nodes), and the independent variables of AGE (ageof the subject), ACID (a laboratory test value that is elevated when the tumor has spread to certainareas), STAGE (whether or not the disease has reached an advanced stage), GRADE (aggressiveness ofthe tumor), and XRAY (positive or negative xray result).

Specifying the dependent and independent variables

The dependent variable is NODALINV 'Cancer spread to lymph nodes', a dichotomous variable.

The independent variables are: AGE 'Age of the subject' ACID 'Laboratory test score' XRAY 'Positive X-ray result' STAGE 'Disease reached advanced stage' GRADE 'Aggressive tumor'

Method for including independent variables

Since we are interested in the relationship between the dependent variable and all of the independentvariables, we will use direct entry of the independent variables.

Stage 2: Develop the Analysis Plan: Sample Size IssuesIn this stage, the following issues are addressed:

Missing data analysis Minimum sample size requirement: 15-20 cases per independent variable

Missing data analysis

There is no missing data in this problem.

Draft O

nly

Minimum sample size requirement: 15-20 cases per independent variable

The data set has 53 cases and 5 independent variables for a ratio of 10 to 1, short of the requirement thatwe have 15-20 cases per independent variable. We should look for opportunities to validate our findingsagainst other samples before generalizing our results.

Stage 2: Develop the Analysis Plan: Measurement Issues:In this stage, the following issues are addressed:

Incorporating nonmetric data with dummy variables Representing Curvilinear Effects with Polynomials Representing Interaction or Moderator Effects

Incorporating Nonmetric Data with Dummy Variables

All of the nonmetric variables have recoded into dichotomous dummy-coded variables.

Representing Curvilinear Effects with Polynomials

We do not have any evidence of curvilinear effects at this point in the analysis.

Representing Interaction or Moderator Effects

We do not have any evidence at this point in the analysis that we should add interaction or moderatorvariables.

Stage 3: Evaluate Underlying AssumptionsIn this stage, the following issues are addressed:

Nonmetric dependent variable with two groups Metric or dummy-coded independent variables

Nonmetric dependent variable having two groups

The dependent variable NODALINV 'Cancer spread to lymph nodes' is a dichotomous variable.

Metric or dummy-coded independent variables

AGE 'Age of the subject' and ACID 'Laboratory test score' are metric variables.

XRAY 'Positive X-ray result', STAGE 'Disease reached advanced stage', and GRADE 'Aggressivetumor' are nonmetric dichotomous variables.

Stage 4: Estimation of Logistic Regression and Assessing Overall Fit:Model EstimationIn this stage, the following issues are addressed:

Compute logistic regression model

Compute the logistic regression

The steps to obtain a logistic regression analysis are detailed on the following screens.

Draft O

nly

Requesting a Logistic Regression

First, choose theRegression > BinaryLogistic...' commandfrom the 'Analyze'menu.

Draft O

nly

Specifying the Dependent Variable

First, move the dependent variableNODALINV 'Cancer spread to lymph nodes'to the 'Dependent:' text box.

Specifying the Independent Variables

First, move the independent variables to theCovariates:' list box.

AGE 'Age of the subject'ACID 'Laboratory test score'XRAY 'Postive Xray result'STAGE 'Disease reached advanced stage'GRADE 'Aggressive tumor'Draft O

nly

Specify the method for entering variables

First, accept the 'Enter'default entry method fromthe 'Method:' popup menu.

Draft O

nly

Specifying Options to Include in the Output

First, click on the'Options...' button toopen the LogisticRegression: Optionsdialog box,Second, in

the'Statisticsand Plots'panel, markthe'Classification plots'checkbox,the'Hosmer-Lemeshowgoodness-of-fit' checkbox, andthe'Casewiselisting ofresiduals'optionbutton withthe 'Outliersoutside 2std. dev.option.

Third, mark the option button'At each step' on the 'Display'panel to display output everytime a variable is added.

Fifth, accept theother defaults andclick on theContinue button.

Fourth, check the“Iteration history”to obtain thestarting model.

Draft O

nly

Specifying the New Variables to Save

First, click on the'Save...' button to openthe 'Logistic Regression:Save New Variables'dialog box.

Second, mark the'Cook's' checkbox onthe 'Influence' panel.

Third, click on the'Continue' button.

Complete the Logistic Regression Request

First, click on the OK button to completethe logistic regression request.

Stage 4: Estimation of Logistic Regression and Assessing Overall Fit:Assessing Model FitIn this stage, the following issues are addressed: Significance test of the model log likelihood (Change in -2LL) Measures Analogous to R2: Cox and Snell R2 and Nagelkerke R2

Draft O

nly

Hosmer-Lemeshow Goodness-of-fit Classification matrices as a measure of model accuracy Check for Numerical Problems Presence of outliers

Initial statistics before independent variables are included

The Initial Log Likelihood Function, (-2 Log Likelihood or -2LL) is a statistical measure like total sumsof squares in regression. If our independent variables have a relationship to the dependent variable, wewill improve our ability to predict the dependent variable accurately, and the log likelihood value willdecrease. The initial –2LL value is 70.252 on step 0, before any variables have been added to themodel.

Significance test of the model log likelihood

The difference between these two measures is the model child-square value (22.126 = 70.252 - 48.126)that is tested for statistical significance. This test is analogous to the F-test for R2 or change in R2 valuein multiple regression which tests whether or not the improvement in the model associated with theadditional variables is statistically significant.

Draft O

nly

In this problem the model Chi-Square value of 22.126 has a significance of 0.000, less than 0.05, so weconclude that there is a significant relationship between the dependent variable and the set ofindependent variables.

Measures Analogous to R2

The next SPSS outputs indicate the strength of the relationship between the dependent variable and theindependent variables, analogous to the R2 measures in multiple regression.

The Cox and Snell R2 measure operates like R2, with higher values indicating greater model fit.However, this measure is limited in that it cannot reach the maximum value of 1, so Nagelkerkeproposed a modification that had the range from 0 to 1. We will rely upon Nagelkerke's measure asindicating the strength of the relationship.

If we applied our interpretive criteria to the Nagelkerke R2 of 0.465, we would characterize therelationship as strong.

Correspondence of Actual and Predicted Values of the Dependent Variable

The final measure of model fit is the Hosmer and Lemeshow goodness-of-fit statistic, which measuresthe correspondence between the actual and predicted values of the dependent variable. In this case,better model fit is indicated by a smaller difference in the observed and predicted classification. A goodmodel fit is indicated by a nonsignificant chi-square value.

The goodness-of-fit measure has a value of 5.954 which has the desirable outcome of nonsignificance.

The Classification Matrices as a Measure of Model Accuracy

The classification matrices in logistic regression serve the same function as the classification matrices indiscriminant analysis, i.e. evaluating the accuracy of the model.

Draft O

nly

If the predicted and actual group memberships are the same, i.e. 1 and 1 or 0 and 0, then the prediction isaccurate for that case. If predicted group membership and actual group membership are different, themodel "misses" for that case. The overall percentage of accurate predictions (77.4% in this case) is themeasure of a model that I rely on most heavily for this analysis as well as for discriminant analysisbecause it has a meaning that is readily communicated, i.e. the percentage of cases for which our modelpredicts accurately.

To evaluate the accuracy of the model, we compute the proportional by chance accuracy rate and themaximum by chance accuracy rates, if appropriate.

The proportional by chance accuracy rate is equal to 0.530 (0.623^2 + 0.377^2). A 25% increase overthe proportional by chance accuracy rate would equal 0.663. Our model accuracy race of 77.4% meetsthis criterion.

Since one of our groups contains 62.3% of the cases, we might also apply the maximum by chancecriterion. A 25% increase over the largest groups would equal 0.778. Our model accuracy race of77.4% almost meets this criterion.

SPSS provides a visual image of the classification accuracy in the stacked histogram as shown below.To the extent to which the cases in one group cluster on the left and the other group clusters on the right,the predictive accuracy of the model will be higher.Draf

t Only

Check for Numerical Problems

There are several numerical problems that can occur in logistic regression that are not detected by SPSSor other statistical packages: multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and"complete separation" whereby the two groups in the dependent event variable can be perfectlyseparated by scores on one of the independent variables.

All of these problems produce large standard errors (over 2) for the variables included in the analysisand very often produce very large B coefficients as well. If we encounter large standard errors for thepredictor variables, we should examine frequency tables, one-way ANOVAs, and correlations for thevariables involved to try to identify the source of the problem.

Draft O

nly

The standard errors and B coefficients are not excessively large, so there is no evidence of a numericproblem with this analysis.

Presence of outliers

There are two outputs to alert us to outliers that we might consider excluding from the analysis: listingof residuals and saving Cook's distance scores to the data set.

SPSS provides a casewise list of residuals that identify cases whose residual is above or below a certainnumber of standard deviation units. Like multiple regression there are a variety of ways to compute theresidual. In logistic regression, the residual is the difference between the observed probability of thedependent variable event and the predicted probability based on the model. The standardized residualis the residual divided by an estimate of its standard deviation. The deviance is calculated by taking thesquare root of -2 x the log of the predicted probability for the observed group and attaching a negativesign if the event did not occur for that case. Large values for deviance indicate that the model does notfit the case well. The studentized residual for a case is the change in the model deviance if the case isexcluded. Discrepancies between the deviance and the studentized residual may identify unusual cases.(See the SPSS chapter on Logistic Regression Analysis for additional details, pages 57-61).

In the output for our problem, SPSS listed three cases that have may be considered outliers with astudentized residuals greater than 2:Draf

t Only

SPSS has an option to compute Cook's distance as a measure of influential cases and add the score to thedata editor. I am not aware of a precise formula for determining what cutoff value should be used, so wewill rely on the more traditional method for interpreting Cook's distance which is to identify cases thateither have a score of 1.0 or higher, or cases which have a Cook's distance substantially different fromthe other. The prescribed method for detecting unusually large Cook's distance scores is to create ascatterplot of Cook's distance scores versus case id.

Request the Scatterplot

First, select the'Scatter...' commandfrom the Graphsmenu.

Second, highlight thethumbnail sketch of the'Simple' scatterplot.

Third, click on theDefine button tospecify the variablesfor the scatterplot.

Draft O

nly

Specifying the Variables for the Scatterplot

First, move the variableCOO_1 to the text boxfor the 'Y Axis:' variable.

Second, move thecase variable to the 'XAxis' text box.

Third, click onthe OK buttonto completethe request

The Scatterplot of Cook's Distances

On the plot of Cook's distances, we see a case that exceeds the 1.0 rule of thumb for influential cases.Scanning the data in the data editor, we find that the case with the large Cook's distance is case 24. If westudy case 24 in the data editor, we will find that this case had the highest score for the acid variable, butno nodal involvement. Comparing this case to the two cases with the next highest acid score, case 25with a score of 136 and case 53 with a score of 126, we see that both of these cases had nodalinvolvement, suggesting that a high acid score is associated with nodal involvement. We can considercase 24 as a candidate for exclusion from the analysis.

Draft O

nly

Stage 5: Interpret the ResultsIn this section, we address the following issues:In this section, we address the following issues:

Identifying the statistically significant predictor variables Direction of relationship and contribution to dependent variable

Identifying the statistically significant predictor variables

The coefficients are found in the column labeled B, and the test that the coefficient is not zero, i.e.changes the odds of the dependent variable event is tested with the Wald statistic, instead of the t-test aswas done for the individual B coefficients in the multiple regression equation.

Draft O

nly

Similar to the output for a regression equation, we examine the probabilities of the test statistic in thecolumn labeled "Sig," where we identity that the variable STAGE 'Disease reached advanced stage' andthe variable XRAY 'Positive X-ray result' have a statistically significant relationship with thedependent variable.

Direction of relationship and contribution to dependent variable

The signs of both of the statistically significant independent variables are positive, indicating a directrelationship with the dependent variable. Our interpretation of these variables is that positive (yes or 1)values to both questions XRAY 'Positive Xray result' and STAGE 'Disease reached advanced stage'are associated with the positive (yes or 1) category of the dependent variable NODALINV 'Cancerspread to lymph nodes'.

Interpretation of the independent variables is aided by the "Exp (B)" column which contains the oddsratio for each independent variable. Thus, we would say that persons with a value of 1 STAGE 'Diseasereached advanced stage' are 4.77 times as likely to have a score of 1 on the dependent variableNODALINV 'Cancer spread to lymph nodes'. Similarly, persons whose score is 1 on the independentvariable XRAY 'Positive X-ray result' have a 7.73 greater likelihood of having lymph nodeinvolvement.

Stage 6: Validate The ModelWhen we have a small sample in the full data set as we do in this problem, a split half validationanalysis is almost guaranteed to fail because we will have little power to detect statistical differences inanalyses of the validation samples. In this circumstance, our alternative is to conduct validationanalyses with random samples that comprise the majority of the sample. We will demonstrate thisprocedure in the following steps:

Computing the First Validation Analysis Computing the Second Validation Analysis The Output for the Validation Analysis

Computing the First Validation Analysis

We set the random number seed and modify our selection variable so that is selects about 75-80% of thesample.

Draft O

nly

Set the Starting Point for Random Number Generation

First, select the'Random NumberSeed...' command fromthe 'Transform' menu.

Second, clickon the 'Setseed to:' optionto access thetext box for theseed number.

Third, acceptthe default'2000000' in the'Set seed to:'text box.

Fourth, click on theOK button to completethis action.

Draft O

nly

Compute the Variable to Select a Large Proportion of the Data Set

First, select the 'Compute...' commandfrom the Transform menu.

Second, create a newvariable named 'split1’ thathas the values 1 and 0 todivide the sample into twoparts. Type the name'split1' into the 'TargetVariable:' text box.

Third, type the formula'uniform(1) > 0.15' in the'Numeric Expression:' textbox. The uniform function willgenerate a random numberbetween 0.0 and 1.0 for eachcase. If the generatedrandom number is greaterthan 0.15, the numericexpression will result in a 1,since the numeric expressionis true. We will include caseswith a split value of 1 in thevalidation analysis.

Fourth, we click on theOK button to computethe split variable.

Draft O

nly

Specify the Cases to Include in the First Validation Analysis

First, select 'LogisticRegression' from the 'DialogRecall' drop down menu.

Draft O

nly

Specify the Value of the Selection Variable for the First Validation Analysis

First, click on the'Select>>" button toexpose the 'SelectionVariable:' text box.

Second, highlightthe 'split1' variableand click on themove button to putit into the'Selection Variable:'text box.

Third, after 'split1=?' appears in the 'SelectionVariable:' text box, click on the Value..' buttonto specify which cases to include in thescreening sample.

Fourth, type a '1' inthe 'Value forSelection Variable:'text box.

Fifth, click on the'Continue' button tocomplete setting thevalue.

Computing the Second Validation Analysis

We reset the random number seed to another value and modify our selection variable so that is selectsabout 75-80% of the sample.Draf

t Only

Set the Starting Point for Random Number Generation

First, select the'Random NumberSeed...' command fromthe 'Transform' menu.

Second, click on the 'Setseed to:' option toaccess the text box forthe seed number.

Third, type'2000001' in the'Set seed to:'text box.

Fourth, click on theOK button to completethis action.

Draft O

nly

Compute the Variable to Select a Large Proportion of the Data Set

First, select the 'Compute...' commandfrom the Transform menu.

Second, create a newvariable named 'split2' thathas the values 1 and 0 todivide the sample into twoparts. Type the name'split2' into the 'TargetVariable:' text box.

Third, type the formula'uniform(1) > 0.15' inthe 'NumericExpression:' text box.The uniform function willgenerate a randomnumber between 0.0 and1.0 for each case. If thegenerated randomnumber is greater than0.15, the numericexpression will result ina 1, since the numericexpression is true. Wewill include cases with asplit value of 1 in thevalidation analysis.

Fourth, we click on theOK button to computethe split variable.

Draft O

nly

Specify the Cases to Include in the Second Validation Analysis

First, select 'LogisticRegression' from the 'DialogRecall' drop down menu.

Draft O

nly

Specify the Value of the Selection Variable for the Second Validation Analysis

First,highlight the'split1=1'text in the'SelectionVariable:'text boxand click onthe movebutton toreturn thevariable tothe list ofvariables.

Second, highlightthe 'split2' variableand click on themove button to putit into the'Selection Variable:'text box.

Third, after 'split2=?' appears in the 'SelectionVariable:' text box, click on the Value..' button tospecify which cases to include in the screeningsample.

Fourth, type a '1' inthe 'Value forSelection Variable:'text box.

Fifth, click onthe 'Continue'button tocompletesetting thevalue.

Generalizability of the Logistic Regression Model

We can summarize the results of the validation analyses in the following table.

Full Model Split1 = 1 Split2 = 1

Model Chi-Square 22.126, p=.0005 16.275, p=.0061 16.609, p=.0053

Nagelkerke R2 .465 .452 .424

Accuracy Rate forLearning Sample

77.36% 75.00% 75.56%

Accuracy Rate forValidation Sample

76.92% 87.50%

SignificantCoefficients(p < 0.05)

STAGE 'Diseasereached advancedstage' and thevariable

XRAY 'PositiveXray result'

STAGE 'Diseasereached advancedstage' and thevariable (0.0531)


STAGE 'Diseasereached advancedstage' and thevariable


Draft O

nly

As we can see in the table, the results for each analysis are approximately the same, except that thevariable STAGE 'Disease reached advanced stage' was not quite significant in the first validationanalysis.

Based on the validation analyses, I would conclude that our results are generalizable.

Draft O

nly

stage one: define the research problem draft only · the spss sample problem to demonstrate these...

Documents