slide 1 hierarchical binary logistic regression. slide 2 hierarchical binary logistic regression in...

Hierarchical Binary Logistic Regression

Hierarchical Binary Logistic Regression

In hierarchical binary logistic regression, we are testing a hypothesis or research question that some predictor independent variables improve our ability to predict membership in the modeled category of the dependent variable, after taking into account the relationship between some control independent variables and the dependent variable.

In multiple regression, we evaluated this question by looking at R2 change, the increase in R2 associated with adding the predictors to the regression analysis.

The analog to R2 in logistic regression is the Block Chi-square, which is the increase in Model Chi-square associated with the inclusion of the predictors.

In standard binary logistic regression, we interpreted the SPSS output that compared Block 0, a model with no independent variables, to Block 1, the model that included the independent variables.

In hierarchical binary logistic regression, the control variables are added SPSS in Block 1, and the predictor variables are added in Block 2, and the interpretation of the overall relationship is based on the change in the relationship from Block 1 to Block 2.

Output for Hierarchical Binary Logistic Regression after control variables are added

In this example, the control variables do not have a statistically significant relationship to the dependent variable, but they can still serve their purpose as controls.

After the controls are added, the measure of error, -2 Log Likelihood, is 195.412.

This output is for the sample problem worked below.

Output for Hierarchical Binary Logistic Regression after predictor variables are added

After the predictors are added, the measure of error, -2 Log Likelihood, is 168.542.

The hierarchical relationship is based on the reduction in error associated with the inclusion of the predictor variables.

Model Chi-square is the cumulative reduction in -2 log likelihood for the controls and the predictors.

The difference between the -2 log likelihood at Block 1 (195.412) and the -2 log likelihood at Block 2 (168.542) is Block Chi-square (26.870) which is significant at p < .001.

The Problem in BlackboardThe Problem in Blackboard

The problem statement tells us: the variables included in the analysis whether each variable should be

treated as metric or non-metric the type of dummy coding and

reference category for non-metric variables

the alpha for both the statistical relationships and for diagnostic tests

The Statement about Level of Measurement

The first statement in the problem asks about level of measurement. Hierarchical binary logistic regression requires that the dependent variable be dichotomous, the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous. SPSS Binary Logistic Regression calls non-metric variables “categorical.”

SPSS Binary Logistic Regression will dummy-code categorical variables for us, provided it is useful to use either the first or last category as the reference category.

Marking the Statement about Level of Measurement

Mark the check box as a correct statement because:• The dependent variable "should marijuana be made legal"

[grass] is dichotomous level, satisfying the requirement for the dependent variable.

• The independent variable "age" [age] is interval level, satisfying the requirement for independent variables.

• The independent variable "sex" [sex] is dichotomous level, satisfying the requirement for independent variables.

• The independent variable "strength of religious affiliation" [reliten] is ordinal level, which the problem instructs us to dummy-code as a non-metric variable.

• The independent variable "general happiness" [happy] is ordinal level, which the problem instructs us to dummy-code as a non-metric variable.

The Statement about Outliers

While we do not need to be concerned about normality, linearity, and homogeneity of variance, we need to determine whether or not outliers were substantially reducing the classification accuracy of the model.

To test for outliers, we run the binary logistic regression in SPSS and check for outliers. Next, we exclude the outliers and run the logistic regression a second time. We then compare the accuracy rates of the models with and without the outliers. If the accuracy of the model without outliers is 2% or more accurate than the model with outliers, we interpret the model excluding outliers.

Running the hierarchical binary logistic regression

Select the Regression | Binary Logistic… command from the Analyze menu.

Selecting the dependent variable

Second, click on the right arrow button to move the dependent variable to the Dependent text box.

First, highlight the dependent variable grass in the list of variables.

Selecting the control independent variables

First, move the control independent variables stated in the problem (age and sex) to the Covariates list box.

Second, click on the Next button to start a new block and add the predictor independent variables.

Selecting the predictor independent variables

First, move the predictor independent variables stated in the problem (reliten and happy) to the Covariates list box.

Second, click on the Categorical button to specify which variables should be dummy coded.

Note that the block is now labeled at 2 of 2.

Declare the categorical variables - 1

Move the variables sex, reliten, and happy to the Categorical Covariates list box.

SPSS assigns its default method for dummy-coding, Indicator coding, to each variable, placing the name of the coding scheme in parentheses after each variable name.

Declare the categorical variables - 2

We will also accept the default of using the last category as the reference category for each variable.

Click on the Continue button to close the dialog box.

We accept the default of using the Indicator method for dummy-coding variable..

Specifying the method for including variables

Since the problem calls for a hierarchical binary logistic regression, we accept the default Enter method for including variables in both blocks.

Adding the values for outliers to the data set - 1

Click on the Save… button to request the statistics that we want to save.

SPSS will calculate the values for standardized residuals and save them to the data set so that we can check for outliers and remove the outliers easily if we need to run a model excluding outliers.

Adding the values for outliers to the data set - 2

Second, click on the Continue button to complete the specifications.

First, mark the checkbox for Standardized residuals in the Residuals panel.

Requesting the output

While optional statistical output is available, we do not need to request any optional statistics.

Click on the OK button to request the output.

Detecting the presence of outliers - 1

SPSS created a new variable, ZRE_1, which contains the standardized residual. If SPSS finds that the data set already contains a ZRE_1 variable, it will create ZRE_2.

I find it easier to delete the ZRE_1 variable after each analysis rather than have multiple ZRE_ variables in the data set, requiring that I remember which one goes with which analysis.


Click the right mouse button on the column header and select Sort Ascending from the pop-up menu.

To detect outliers, we will sort the ZRE_1 column twice:• first, in ascending order to identify outliers with a

standardized residual of +2.58 or greater.• second, in descending order to identify outliers with

a standardized residual of -2.58 or less.


After scrolling down past the cases with missing data (. in the ZRE_1 column), we see that we have one outlier that has a standardized residual of -2.58 or less.


To check for outliers with large positive standardized residuals, click the right mouse button on the column header and select Sort Ascending from the pop-up menu.


Since we found outliers, we will run the model excluding them and compare accuracy rates to determine which one we will interpret.

Had there been no outliers, we would move on to the issue of sample size.

After scrolling up to the top of the data set, we see that there are no outliers that have standardized residuals of +2.58 or more.

Running the model excluding outliers - 1

We will use a Select Cases command to exclude the outliers from the analysis.


Second, click on the If button to specify the condition.

First, in the Select Cases dialog box, mark the option button If condition is satisfied.


The formula specifies that we should include cases if the standard score for the standardized residual (ZRE_1) is less than 2.58.

The abs() or absolute value function tells SPSS to ignore the sign of the value.

After typing in the formula, click on the Continue button to close the dialog box.

To eliminate the outliers, we request the cases that are not outliers be selected into the analysis.


SPSS displays the condition we entered on the Select Cases dialog box.

Click on the OK button to close the dialog box.


SPSS indicates which cases are excluded by drawing a slash across the case number.

Scrolling down in the data, we see that the outliers and cases with missing values are excluded.


To run the logistic regression excluding outliers, select Logistic Regression from the Dialog Recall menu.


Click on the Save button to open the dialog box.

The only change we will make is to clear the check box for saving standardized residuals.


First, clear the check box for Standardized residuals.

Second, click on the Continue button to close the dialog box.


Finally, click on the OK button to request the output.

Accuracy rate of the baseline model including all cases

Navigate to the Classification Table for the logistic regression with all cases. To distinguish the two models, I often refer to the first one as the baseline model.

The accuracy rate for the model with all cases is 71.3%.

Accuracy rate of the revised model excluding outliers

Navigate to the Classification Table for the logistic regression excluding outliers. To distinguish the two models, I often refer to the first one as the revised model.

The accuracy rate for the model excluding outliers is 71.1%.

Marking the statement for excluding outliers

In the initial logistic regression model, 1 case had a standardized residual of +2.58 or greater or -2.58 or lower:

- Case 20001058 had a standardized residual of -2.78

The classification accuracy of the model that excluded outliers (71.14%) was not greater by 2% or more than the classification accuracy for the model that included all cases (71.33%). The model including all cases should be interpreted.

The check box is nor marked because removing outliers did not increase the accuracy of the model. All of the remaining statements will be evaluated based on the output for the model that includes all cases.

The statement about multicollinearity and other numerical problems

Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, cells with a zero count for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted.

Checking for multicollinearity

The standard errors for the variables included in the analysis were: the standard error for "age" [age] was .01, the standard error for survey respondents who said that overall they were not too happy was .92, the standard error for survey respondents who said that overall they were pretty happy was .47, the standard error for survey respondents who said they had no religious affiliation was .53, the standard error for survey respondents who said they had a somewhat strong religious affiliation was .70, the standard error for survey respondents who said they had a not very strong religious affiliation was .47 and the standard error for survey respondents who were male was .39.

Marking the statement about multicollinearity and other numerical problems

Since none of the independent variables in this analysis had a standard error larger than 2.0, we mark the check box to indicate there was no evidence of multicollinearity.

The statement about sample size

Hosmer and Lemeshow, who wrote the widely used text on logistic regression, suggest that the sample size should be 10 cases for every independent variable.

The output for sample size

The 150 cases available for the analysis satisfied the recommended sample size of 70 (10 cases per independent variable) for logistic regression recommended by Hosmer and Lemeshow. .

We find the number of cases included in the analysis in the Case Processing Summary.

Marking the statement for sample size

Since we satisfy the sample size requirement, we mark the check box.

The hierarchical relationship between the dependent and independent variables

In a hierarchical logistic regression, the presence of a relationship between the dependent variable and combination of independent variables entered after the control variables have been taken into account is based on the statistical significance of the block chi-square for the second block of variables in which the predictor independent variables are included.

The output for the hierarchical relationship

In this analysis, the probability of the block chi-square was was less than or equal to the alpha of 0.05 (χ²(5, N = 150) = 26.87, p < .001). The null hypothesis that there is no difference between the model with only the control variables versus the model with the predictor independent variables was rejected.

The existence of the hierarchical relationship between the predictor independent variables and the dependent variable was supported.

Marking the statement for hierarchical relationship

Since the hierarchical relationship was statistically significant, we mark the check box.

The statement about the relationship between age and legalization of marijuana

Having satisfied the criteria for the hierarchical relationship, we examine the findings for individual relationships with the dependent variable. If the overall relationship were not significant, we would not interpret the individual relationships.

The first statement concerns the relationship between age and legalization of marijuana.

Output for the relationship between age and legalization of marijuana

The probability of the Wald statistic for the control independent variable "age" [age] (χ²(1, N = 150) = 1.83, p = .176) was greater than the level of significance of .05. The null hypothesis that the b coefficient for "age" [age] was equal to zero was not rejected. "Age" [age] does not have an impact on the odds that survey respondents supported the legalization of marijuana. The analysis does not support the relationship that 'For each unit increase in "age", survey respondents were 1.7% less likely to supported the legalization of marijuana'

Marking the statement for relationship between age and legalization of marijuana

Since the relationship was not statistically significant, we do not mark the check box for the statement.

Statement for relationship between general happiness and legalization of marijuana

The next statement concerns the relationship between the dummy-coded variable for general happiness and legalization of marijuana.

Output for relationship between general happiness and legalization of marijuana

The probability of the Wald statistic for the predictor independent variable survey respondents who said that overall they were not too happy (χ²(1, N = 150) = 13.96, p < .001) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that overall they were not too happy was equal to zero was rejected. The value of Exp(B) for the variable survey respondents who said that overall they were not too happy was 31.642 which implies the odds were multiplied by approximately 31.6 times. The statement that 'Survey respondents who said that overall they were not too happy were approximately 31.6 times more likely to supported the legalization of marijuana compared to those who said that overall they were very happy' is correct.

Marking the relationship between general happiness and legalization of marijuana

Since the relationship was statistically significant, and survey respondents who said that overall they were not too happy were approximately 31.6 times more likely to supported the legalization of marijuana compared to those who said that overall they were very happy is correct, the statement is marked.

Statement for relationship between general happiness and legalization of marijuana

The next statement concerns the relationship between the dummy-coded variable for general happiness and legalization of marijuana.

Output for relationship between general happiness and legalization of marijuana

The probability of the Wald statistic for the predictor independent variable survey respondents who said that overall they were pretty happy (χ²(1, N = 150) = 3.42, p = .064) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that overall they were pretty happy was equal to zero was not rejected. Survey respondents who said that overall they were pretty happy does not have an impact on the odds that survey respondents supported the legalization of marijuana. The analysis does not support the relationship that 'Survey respondents who said that overall they were pretty happy were approximately two and a quarter times more likely to supported the legalization of marijuana compared to those who said that overall they were very happy'

Marking the relationship between general happiness and legalization of marijuana


Statement for relationship between religious affiliation and legalization of marijuana

The next statement concerns the relationship between the dummy-coded variable for religious affiliation and legalization of marijuana.

Output for relationship between religious affiliation and legalization of marijuana

The probability of the Wald statistic for the predictor independent variable survey respondents who said they had no religious affiliation (χ²(1, N = 150) = 4.39, p = .036) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said they had no religious affiliation was equal to zero was rejected. The value of Exp(B) for the variable survey respondents who said they had no religious affiliation was 3.035 which implies the odds increased by approximately three times. The statement that 'Survey respondents who said they had no religious affiliation were approximately three times more likely to supported the legalization of marijuana compared to those who said they had a strong religious affiliation' is correct.

Marking the relationship between religious affiliation and legalization of marijuana

Since the relationship was statistically significant, and survey respondents who said they had no religious affiliation were approximately three times more likely to supported the legalization of marijuana compared to those who said they had a strong religious affiliation is correct , the statement is marked.


The next statement concerns the relationship between the dummy-coded variable for a somewhat strong religious affiliation and legalization of marijuana.

Output for the relationship between religious affiliation and legalization of marijuana

The probability of the Wald statistic for the predictor independent variable survey respondents who said they had a somewhat strong religious affiliation (χ²(1, N = 150) = .67, p = .414) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said they had a somewhat strong religious affiliation was equal to zero was not rejected. Survey respondents who said they had a somewhat strong religious affiliation does not have an impact on the odds that survey respondents support the legalization of marijuana. The analysis does not support the relationship that 'Survey respondents who said they had a somewhat strong religious affiliation were 43.7% less likely to support the legalization of marijuana compared to those who said they had a strong religious affiliation'


The next statement concerns the relationship between the dummy-coded variable for a not very strong religious affiliation and legalization of marijuana.

Output for the relationship between religious affiliation and legalization of marijuana

The probability of the Wald statistic for the predictor independent variable survey respondents who said they had a not very strong religious affiliation (χ²(1, N = 150) = .24, p = .626) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said they had a not very strong religious affiliation was equal to zero was not rejected. Survey respondents who said they had a not very strong religious affiliation does not have an impact on the odds that survey respondents support the legalization of marijuana. The analysis does not support the relationship that 'Survey respondents who said they had a not very strong religious affiliation were 25.8% more likely to support the legalization of marijuana compared to those who said they had a strong religious affiliation'


Since the relationship was not statistically significant, the check box is not marked.

The statement for the relationship between sex and legalization of marijuana

The next statement concerns the relationship between the sex and legalization of marijuana.

Output for the relationship between sex and legalization of marijuana

The probability of the Wald statistic for the control independent variable survey respondents who were male (χ²(1, N = 150) = .13, p = .719) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who were male was equal to zero was not rejected. Survey respondents who were male does not have an impact on the odds that survey respondents support the legalization of marijuana. The analysis does not support the relationship that 'Survey respondents who were male were 13.1% less likely to support the legalization of marijuana compared to those who were female'

Marking the statement for the relationship between sex and legalization of marijuana

Since the relationship was not statistically significant, the check box is not marked.

Statement about the usefulness of the model based on classification accuracy

The final statement concerns the usefulness of the logistic regression model. The independent variables could be characterized as useful predictors distinguishing survey respondents who use a computer from survey respondents who not use a computer if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.

Computing proportional by-chance accuracy rate

The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (.633² + .367² = .536).

The proportion in the largest group is 63.3% or .633. The proportion in the other group is 1.0 – 0.633 = .367.

At Block 0 with no independent variables in the model, all of the cases are predicted to be members of the modal group, 1=Legal in this example.

Output for the usefulness of the model based on classification accuracy

To be characterized as a useful model, the accuracy rate should be 25% higher than the by chance accuracy rate.

The by chance accuracy criteria is compute by multiplying the by chance accurate rate of .536 times 1.25, or 1.25 x .536 = .669 (66.9%)..

The classification accuracy rate computed by SPSS was 71.3% which was greater than or equal to the proportional by chance accuracy criteria of 66.9% (1.25 x 53.6% = 66.9%). The criteria for classification accuracy is satisfied.

The criteria for classification accuracy is satisfied.

Marking the statement for usefulness of the model

Since the criteria for classification accuracy was satisfied, the check box is marked.

Hierarchical Binary Logistic Regression: Level of Measurement

No

No

Ordinal level variable treated as metric?

Yes

Yes

Level of measurement ok?

Consider limitation in discussion of findings

Mark check box for level of measurement

Do not mark check box for level of measurement

Mark: Inappropriate application of the statistic

Stop

Standard Binary Logistic Regression: Exclude Outliers

Run Baseline Binary Logistic Regression, Including All Cases,

Requesting Standardized Residuals

No

YesAccuracy rate for revisedModel >= accuracy rate for baseline model + 2%

Run Revised Binary Logistic Regression, Excluding Outliers (standardized

Residuals >= 2.58)

Interpret baseline model

Interpret revised model

Mark check box for excluding outliers

Do not mark check box for excluding outliers

Hierarchical Binary Logistic Regression: Multicollinearity and Sample Size

No

YesMulticollinearity/Numerical Problems (S. E. > 2.0)

Stop

Yes

NoAdequate Sample Size(Number of IV’s x 10)

Consider limitation in discussion of findings

Mark check box for no multicollinearity

Do not mark check box for no multicollinearity

Mark check box for sample size

Do not mark check box for sample size

Hierarchical Binary Logistic Regression: Hierarchical Relationship

Probability of Block Chi-square for Block 2 ≤ α

Yes

Do not mark check box for hierarchical relationship

No

Stop

Mark check box for hierarchical relationship

The biggest distinction between hierarchical and standard models is our focus on the contribution of the predictors in addition to the controls.

Hierarchical Binary Logistic Regression: Individual Relationships

Yes

Individual relationship(Wald Sig ≤ α)?

No

Mark check box for individual relationship

Correct interpretation of direction and strength of

relationship?

Yes

Do not mark check box for individual relationship

No

Additional individualRelationships to

interpret?Yes

No

Hierarchical Binary Logistic Regression: Classification Accuracy

Yes

Classification accuracy > 1.25 x by chance

accuracy rate

Do not mark check box for classification accuracy

No

Mark check box for classification accuracy

slide 1 hierarchical binary logistic regression. slide 2 hierarchical binary logistic regression in...

Documents

control variables

predictor variables

control independent

block chisquare

regression analysis

multiple regression

code categorical variables

hierarchical relationship