sw388r7 data analysis & computers ii slide 1 logistic regression – hierarchical entry of...

SW388R7Data Analysis

& Computers II

Slide 1

Logistic Regression – Hierarchical Entry of Variables

Sample Problem

Steps in Solving Problems


& Computers II

Slide 2

Level of Measurement - question

The first question requires us to examine the level of measurement requirements for binary logistic regression.

Binary logistic regression requires that the dependent variable be dichotomous and the independent variables be metric or dichotomous.


& Computers II

Slide 3

Level of Measurement – evidence and answer

True with caution is the correct answer, since we satisfy the level of measurement requirements, but include ordinal level variables in the analysis.


& Computers II

Slide 4

Sample Size - question

The second question asks about the sample size requirements for binary logistic regression.

To answer this question, we will run the a baseline logistic regression to obtain some basic data about the problem and solution. The phrase “hierarchical entry” dictates the method for including variables in the model.


& Computers II

Slide 5

Request hierarchical logistic regression

Select the Regression | Binary Logistic… command from the Analyze menu.


& Computers II

Slide 6

Selecting the dependent variable

Second, click on the right arrow button to move the dependent variable to the Dependent text box.

First, highlight the dependent variable grass in the list of variables.


& Computers II

Slide 7

Selecting the control independent variables

First, move the control independent variable, sex, listed in the problem to the Covariates list box. This will be the only variable in Block 1.

Second, make sure that Enter is selected in the Method drop down menu. This tells SPSS that all of the variables in Block 1 will be included at the same time.


& Computers II

Slide 8

Selecting the block for the predictors

Next, click on the Next button to add the second block that will contain the predictors.


& Computers II

Slide 9

Adding the predictor independent variables

First, move the predictors to the Covariates list box.

Block 2 of 2 tells us that we are entering variables in the second block.


& Computers II

Slide 10

Specifying the method for including variables

In our hierarchical regression, we will specify that all of the variables in Block 2 be entered simultaneously when the block is entered.


& Computers II

Slide 11

Including the option for listing outliers

SPSS will include a table of outliers in the output if we include the option to produce the table.


& Computers II

Slide 12

Set the option for listing outliers

Second, click on the At last step option to display the table of outliers only at the end of the analysis.

First, mark the checkbox for Casewise listing of residuals, accepting the default of outliers outside 2 standard deviations.


& Computers II

Slide 13

Requesting statistics needed for identifying outliers

SPSS will calculate the values for studentized residuals and save them to the data set so that we can remove the outliers easily.

Click on the Save… button to request the statistics what we want to save.


& Computers II

Slide 14

Saving statistics needed for removing outliers

Second, click on the Continue button to complete the specifications.

First, mark the checkbox for Studentized residuals in the Residuals panel.


& Computers II

Slide 15

Completing the logistic regression request

Click on the OK button to request the output for the logistic regression.

The logistic procedure supports the selection of subsets of cases, automatic recoding of nominal variables, saving other diagnostic statistics like standardized residuals, and options for additional statistics. However, none of these are needed for this analysis.


& Computers II

Slide 16

Case Processing Summary

163 60.4

107 39.6

270 100.0

0 .0

270 100.0

Unweighted Casesa

Included in Analysis

Missing Cases

Total

Selected Cases

Unselected Cases

Total

N Percent

If weight is in effect, see classification table for the totalnumber of cases.

a.

Sample size – evidence and answer

The minimum ratio of valid cases to independent variables for logistic regression is 10 to 1, with a preferred ratio of 20 to 1. In this analysis, there are 163 valid cases and 3 independent variables. The ratio of cases to independent variables is 54.33 to 1, which satisfies the minimum requirement. In addition, the ratio of 54.33 to 1 satisfies the preferred ratio of 20 to 1.

The question which precipitated computing the logistic regression in SPSS was the question about sample size. We can now answer that question.


& Computers II

Slide 17

Outliers - question

Outliers are defined as cases that have a studentized residual of +/-2.0 or larger.


& Computers II

Slide 18

Outliers – evidence and answer

False is the correct answer for the statement that there are no outliers.

Using the criteria of studentized residuals greater than +/- 2.0, SPSS identified three outliers: case number 29; case number 92; and case number 173.

Note that the cases are identified by the information in the footnote, and not by the list of standardized residuals (zresid) in the table.


& Computers II

Slide 19

Model Selected for Interpretation - question

Since we have found outliers, we need to determine whether we will interpret the model that includes all cases or the model that excludes outliers.


& Computers II

Slide 20

Accuracy rate for baseline model

The accuracy rate for the model used to detect outliers (70.6%) is used for the baseline accuracy rate.

We will compare this to the accuracy rate for the model excluding outliers.

In hierarchical logistic regression, we interpret the output for Block 2, when both the controls and the predictors have been entered into the analysis.


& Computers II

Slide 21

Removing the outliers from the analysis - 1

Our next step is to run the revised logistic regression model that omits outliers. Our first step in this process is to tell SPSS to exclude the outliers from the analysis.

We accomplish this by telling SPSS to include in the analysis all of the cases that are not outliers.

First, select the Select Cases… command from the Transform menu.


& Computers II

Slide 22


First, mark the If condition is satisfied option button to indicate that we will enter a specific condition for including cases.

Second, click on the If… button to specify the criteria for inclusion in the analysis.


& Computers II

Slide 23


To eliminate the outliers, we request the cases that are not outliers be selected into the analysis.

The formula specifies that we should include cases if the standard score for the residual (sre_1) is less than or equal to 2.00.

The abs() or absolute value function tells SPSS to ignore the sign of the value.

After typing in the formula, click on the Continue button to close the dialog box.


& Computers II

Slide 24


To complete the request, we click on the OK button.


& Computers II

Slide 25

Revised logistic regression omitting outliers - 1

To run the logistic regression eliminating the outliers, select the Logistic Regression command from the menu that drops down when you click on the Dialog Recall button.


& Computers II

Slide 26


When we wanted to detect outliers, we asked SPSS to save the studentized residuals to the data editor.

Since we no longer need the studentized residuals, we will omit saving them from this analysis.

Click on the Save button to open the dialog box.


& Computers II

Slide 27


Clear the checkbox for Studentized Residuals so that SPSS does not save a new set of them in the data editor when it runs the new regression.

Click on the Continue button to close the dialog box.


& Computers II

Slide 28


Click on the OK button to obtain the output for the revised model.

The other specifications for the logistic regression are the same as previously marked.


& Computers II

Slide 29

Accuracy rate for revised model

Prior to the removal of outliers, the accuracy rate of the logistic regression model was 70.6%. After removing outliers, the accuracy rate of the logistic regression model was 71.3%.

Since the logistic regression omitting outliers was less than two percent more accurate in classifying cases than the logistic regression with all cases, the logistic regression model with all cases is interpreted.

False is the correct answer to the statement tht we will interpret the model that excludes outliers. We will interpret the model that includes all cases.

In hierarchical logistic regression, we interpret the output for Block 2, when both the controls and the predictors have been entered into the analysis.


& Computers II

Slide 30

Restore all cases and run the baseline model again

Since we will interpret the model including the outliers, we need to add the excluded cases back into the analysis.

Choose the Select Cases… command from the Data menu.


& Computers II

Slide 31

Select all cases

First, mark the option button for All cases.

Second, click on the OK button to close the dialog box.


& Computers II

Slide 32

Re-run baseline model - 1

To re-run the baseline logistic regression including the outliers, select the Logistic Regression command from the menu that drops down when you click on the Dialog Recall button.


& Computers II

Slide 33

Re-run baseline model - 2

We want to run the same logistic regression analysis we have previously run. All we need to do is click on the OK button.


& Computers II

Slide 34

Multicollinearity and Numerical Problems - question

Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, cells with a zero count for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables.

Analyses that indicate numerical problems should not be interpreted.


& Computers II

Slide 35

Multicollinearity and Numerical Problems – evidence and answer

The standard errors for the variables included in the analysis were: "liberal or conservative political views" (.133), "general happiness" (.362) and "sex" (.356).

None of the independent variables in this analysis had a standard error larger than 2.0.

True is the correct answer.


& Computers II

Slide 36

Overall Relationship - question

The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the model chi-square at Block 2 after the independent variables have been added to the analysis.


& Computers II

Slide 37

Overall Relationship – evidence and answer

True is the correct answer.

In a hierarchical logistic regression, the presence of a relationship between the dependent variable and combination of independent variables entered after the control variables have been included is based on the statistical significance of the block chi-square for the second block of variables in which the predictor independent variables are included.

In this analysis, the probability of the block chi-square (20.308) was p<0.001, less than or equal to the level of significance of 0.05. The null hypothesis that there is no difference between the model with only a constant and the control variables versus the model with the predictor independent variables was rejected. The contribution of the relationship between the predictor independent variables and the dependent variable was supported


& Computers II

Slide 38

Individual Relationships – Political Views - question

To answer the question about an individual relationship, we look to the significance of the Wald test of the B coefficient and the interpretation of the odds ratio.


& Computers II

Slide 39

Variables in the Equation

.017 .356 .002 1 .961 1.018

-.352 .133 7.029 1 .008 .704

-1.253 .362 12.003 1 .001 .286

3.484 1.126 9.577 1 .002 32.597

SEX

POLVIEWS

HAPPY

Constant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: POLVIEWS, HAPPY.a.

Individual Relationships – Political Views – evidence and answer

The probability of the Wald statistic for the variable "liberal or conservative political views" [polviews] was p=0.008, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for "liberal or conservative political views" [polviews] was equal to zero was rejected.

"Liberal or conservative political views" [polviews] is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who were more conservative.


& Computers II

Slide 40


.017 .356 .002 1 .961 1.018

-.352 .133 7.029 1 .008 .704

-1.253 .362 12.003 1 .001 .286

3.484 1.126 9.577 1 .002 32.597

SEX

POLVIEWS

HAPPY

Constant

Step1

a



Individual Relationships – Political Views – evidence and answer

The value of Exp(B) was 0.704 which implies a decrease in the odds of 29.6% (0.704 - 1.0 = -0.296).

The correct interpretation of the relationship is that 'survey respondents who were more conservative were 29.6% less likely to have been more supportive that the use of marijuana should be made legal.'

True with caution is the correct answer.

Caution in interpreting the relationship should be exercised because of the ordinal level variable "liberal or conservative political views" [polviews] was treated as metric.


& Computers II

Slide 41

Individual Relationships – General Happiness - question

To answer the question about an individual relationship, we look to the significance of the Wald test of the B coefficient and the interpretation of the odds ratio.


& Computers II

Slide 42


.017 .356 .002 1 .961 1.018

-.352 .133 7.029 1 .008 .704

-1.253 .362 12.003 1 .001 .286

3.484 1.126 9.577 1 .002 32.597

SEX

POLVIEWS

HAPPY

Constant

Step1

a



Individual Relationships – General Happiness – evidence and answer

The probability of the Wald statistic for the variable "general happiness" [happy] was p=0.001, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for "general happiness" [happy] was equal to zero was rejected.

"General happiness" [happy] is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who were happier overall.


& Computers II

Slide 43


.017 .356 .002 1 .961 1.018

-.352 .133 7.029 1 .008 .704

-1.253 .362 12.003 1 .001 .286

3.484 1.126 9.577 1 .002 32.597

SEX

POLVIEWS

HAPPY

Constant

Step1

a



Individual Relationships – General Happiness – evidence and answer

The value of Exp(B) was 0.286 which implies a decrease in the odds of 71.4% (0.286 - 1.0 = -0.714).

The correct interpretation of the relationship is that 'survey respondents who were happier overall were 71.4% less likely to have been more supportive that the use of marijuana should be made legal.'


Caution in interpreting the relationship should be exercised because of the ordinal level variable "general happiness" [happy] was treated as metric.


& Computers II

Slide 44

Classification Accuracy - question

The independent variables could be characterized as useful predictors distinguishing survey respondents who have been more supportive that the use of marijuana should be made legal from survey respondents who have been less supportive that the use of marijuana should be made legal if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.


& Computers II

Slide 45

Classification Accuracycomputing by chance accuracy rate

The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0. The proportion in the Not Legal group was 0.664, making the proportion in the Legal group 0.356 (1.0 – 0.664).

The proportion of cases in each group are then squared and summed (0.644² + 0.356² = 0.541).

The proportional by chance accuracy criteria is 25% higher, or 67.7% (1.25 x 54.1% = 67.7%).


& Computers II

Slide 46

Classification Accuracy – evidence and answer

The classification accuracy rate computed by SPSS was 70.6% which was greater than or equal to the proportional by chance accuracy criteria of 67.7% (1.25 x 54.1% = 67.7%).

The criteria for classification accuracy is satisfied.

True is the correct answer to the question.


& Computers II

Slide 47

Validation - question

For a hierarchical logistic regression, the 75%-25% cross-validation must verify the overall contribution of the independent variables entered after the control variables have been included.

In addition, the pattern of significance for the individual relationships between the dependent variable and the predictors for the training sample should be the same as the pattern for the full data set.

And finally, the classification accuracy rate for the validation sample must be within 2% of the accuracy rate for the training sample.


& Computers II

Slide 48

Validation analysis:set the random number seed

To set the random number seed, select the Random Number Seed… command from the Transform menu.


& Computers II

Slide 49

Set the random number seed

First, click on the Set seed to option button to activate the text box.

Second, type in the random seed stated in the problem.

Third, click on the OK button to complete the dialog box.

Note that SPSS does not provide you with any feedback about the change.


& Computers II

Slide 50

Validation analysis:compute the split variable

To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.


& Computers II

Slide 51

The formula for the split variable

First, type the name for the new variable, split, into the Target Variable text box.

Second, the formula for the value of split is shown in the text box.

The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0. 75.

If the random number is less than or equal to 0.75, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.75, the formula will return a 0, the SPSS numeric equivalent to false.Third, click on the OK

button to complete the dialog box.


& Computers II

Slide 52

Running the logistic regression again with the training sample

We repeat the logistic regression analysis for the training sample.

Select the Regression | Binary Logistic… command from the Analyze menu.


& Computers II

Slide 53

Using "split" as the selection variable

First, scroll down the list of variables and highlight the variable split. Second, click on the right

arrow button to move the split variable to the Selection Variable text box.


& Computers II

Slide 54

Setting the value of split to select cases

When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the

Rule… button to enter a value for split.


& Computers II

Slide 55

Completing the value selection

First, type the value for the first half of the sample, 1, into the Value text box.

Second, click on the Continue button to complete the value entry.


& Computers II

Slide 56

Requesting output for the validation sample

When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable.

Click on the OK button to request the output.


& Computers II

Slide 57

Validation – evidence and answerOverall relationship

The significance of the overall relationship between the individual independent variables and the dependent variable supports the interpretation of the model using the full data set.

For a hierarchical logistic regression, the cross-validation must verify the contribution of the independent variables entered after the control variables have been included. This is based on the statistical significance of the block chi-square for the second block of variables. In the cross-validation analysis, the relationship between the independent variables and the dependent variable taking into account the effect of the control variables was statistically significant. The probability for the block chi-square (23.287) testing the block of independent variables was p<0.001.


& Computers II

Slide 58

Validation – evidence and answerIndividual relationship – Political Views

The relationship between "liberal or conservative political views" [polviews] and “support for legalization of marijuana" [grass] was statistically significant for the model using the full data set (p=0.008).

Similarly, the relationship in the cross-validation analysis was statistically significant. In the cross-validation analysis, the probability for the test of relationship between "liberal or conservative political views" [polviews and “support for legalization of marijuana" [grass] was p=0.004, which was less than or equal to the level of significance of 0.05 and statistically significant.


& Computers II

Slide 59

Validation – evidence and answerIndividual relationship – General

Happiness

The pattern of significance for the individual relationships between the dependent variable and the independent variables was the same for the analysis using the full data set and the 75% training sample.

The relationship between “general happiness" [happy] and “support for legalization of marijuana" [grass] was statistically significant for the model using the full data set (p=0.001).

Similarly, the relationship in the cross-validation analysis was statistically significant. In the cross-validation analysis, the probability for the test of relationship between “general happiness" [happy] and “support for legalization of marijuana" [grass] was p<0.001, which was less than or equal to the level of significance of 0.05 and statistically significant.


& Computers II

Slide 60

Validation – evidence and answerClassification accuracy

The classification accuracy rate for the model using the training sample was 66.9%, compared to 66.7% for the validation sample. The shrinkage in classification accuracy for the validation analysis is the difference between the accuracy for the training sample (66.9%) and the accuracy for the validation sample (66.7%), which equals 0.2% in this analysis. The shrinkage was within the 2% criteria for minimal shrinkage, small enough to support a conclusion that the logistic regression model based on this analysis would be effective in predicting scores for cases other than those included in the calculation of the regression analysis.

The validation analysis supports the generalizability of the analysis.

The answer to the question is true.


& Computers II

Slide 61

Summary of Findings - question

The final question is a summary of the findings of the analysis: overall relationship, individual relationships, and usefulness of the model.

Cautions are added, if needed, for sample size and level of measurement issues.


& Computers II

Slide 62

Summary of Findings – evidence and answer



& Computers II

Slide 63

Hierarchical binary logistic regression: level of measurement

Inappropriate application of a statistic

NoDependent dichotomous?Independent variables metric or dichotomous?

Question: Variables included in the analysis satisfy the level of measurement requirements?

Yes

Ordinal independent variable included in analysis?

No

Yes

True

True with caution


& Computers II

Slide 64

Hierarchical binary logistic regression: sample size

Yes

Ratio of cases to independent variables at least 10 to 1?

Yes

No Inappropriate application of a statistic

Yes

Ratio of cases to independent variables at least 20 to 1?

Yes

NoTrue with caution

Question: Number of variables and cases satisfy sample size requirements?

Run baseline logistic regression, using hierarchical method for including variables identified in the research question.

Record classification accuracy for evaluation of the effect of removing outliers.

True


& Computers II

Slide 65

Hierarchical binary logistic regression: detecting outliers

Question: Outliers were not detected in the analysis?

Outliers for the solution identified by studentized residuals > ±2.0?

Yes

No

False

True


& Computers II

Slide 66

Hierarchical binary logistic regression: selecting model for interpretation

Outliers for the solution identified by studentized residuals > ±2.0?

Yes

No

Run revised logistic regression excluding outliers, using method for including variables identified in research question.

Classification accuracy omitting outliers better than baseline by 2% or more?

Pick baseline logistic regression for interpretationPick logistic regression that

omits outliers for interpretation

Yes No

Question: Interpret baseline model or model excluding outliers ?

FalseTrue


& Computers II

Slide 67

Hierarchical binary logistic regression: multicollinearity or numerical problems

No

Standard errors of coefficients indicate presence of numerical problems (s.e. > 2.0)?

YesFalse

Question: no evidence of multicollinearity or numerical problems?

True

If numerical problem found, halt analysis until problem is resolved.


& Computers II

Slide 68

Hierarchical binary logistic regression: overall relationship

Yes

FalseNoRelationship confirmed by

significance of block chi-square for predictors at step 2?

Caution for ordinal variable or sample size not meeting preferred requirements?

No

Yes True with caution

True

Question: overall relationship between independent variables and dependent variable?


& Computers II

Slide 69

Hierarchical binary logistic regression: relationships between IV's and DV

Individual relationship confirmed by significance of Wald statistic?

Direction and size of odds ratio interpreted correctly?

No

Yes

False

NoFalse

Yes

Caution for ordinal variable or sample size not meeting preferred requirements?

No

YesTrue with caution

True

Question: Interpretation of relationship between independent variable and dependent variable groups?


& Computers II

Slide 70

Hierarchical binary logistic regression: classification accuracy

Yes

Overall accuracy rate is 25% > than proportional by chance accuracy rate?

Yes

NoFalse

Question: Classification accuracy sufficient to be characterized as a useful model?

True


& Computers II

Slide 71

Hierarchical binary logistic regression: validation - 1

Compute 75-25 split variable.

Re-run logistic regression, using method for including variables identified in the research question.

Block chi-square for predictors at Block 2 <= level of significance?

Yes

NoFalse

Question: Validation analysis supports generalizability of model?


& Computers II

Slide 72

Hierarchical binary logistic regression: validation - 2

Significance of predictors in training sample matches pattern for model using full data set?

Yes

NoFalse

Shrinkage in classification accuracy for holdout sample < 2%?

Yes

NoFalse


& Computers II

Slide 73

Hierarchical binary logistic regression:summary of findings - 1

Question: Summary of findings correctly stated, including cautions?

Overall relationship correctly stated?

Yes

NoFalse

Individual relationship with IV and DV correctly stated?

Yes

NoFalse

Classification accuracy supports useful model?

Yes

NoFalse


& Computers II

Slide 74

Hierarchical binary logistic regression:summary of findings - 2

One or more IV's are ordinal level variables?

No

Yes

True

Satisfies preferred ratio of cases to IV's of 20 to 1?

No

YesYes

True with caution

True with caution