Transcript
Page 1: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 1

Standard Binary Logistic Regression

Page 2: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 2

Logistic regression

Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or non-metric independent variables. (SPSS now supports Multinomial Logistic Regression that can be used with more than two groups, but our focus now is on binary logistic regression for two groups.)

Logistic regression combines the independent variables to estimate the probability that a particular event will occur, i.e. a subject will be a member of one of the groups defined by the dichotomous dependent variable. In SPSS, the model is always constructed to predict the group with higher numeric code. If responses are coded 1 for Yes and 2 for No, SPSS will predict membership in the No category. If responses are coded 1 for No and 2 for Yes, SPSS will predict membership in the Yes category. We will refer to the predicted event for a particular analysis as the modeled event.

Predicting the “No” event create some awkward wording in our problems. Our only option for changing this is to recode the variable.

If the probability for group membership in the modeled category is above some cut point (the default is 0.50), the subject is predicted to be a member of the modeled group. If the probability is below the cut point, the subject is predicted to be a member of the other group.

For any given case, logistic regression computes the probability that a case with a particular set of values for the independent variable is a member of the modeled category

Page 3: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 3

Level of measurement requirements

Logistic regression analysis requires that the dependent variable be dichotomous.

Logistic regression analysis requires that the independent variables be metric or non-metric. The logistic regression procedure will dummy-code non-metric variables for us. For logistic regression, we will use indicator dummy-coding, rather than deviation dummy-coding since I think it makes more sense to compare the odds for two groups rather than compare the odds for one group to the average odds for all groups.

If an independent variable is ordinal, we can either treat it as non-metric and dummy-code it or we can treat it as interval, in which case we will attach the usual caution.

Dichotomous independent variables do not have to be dummy-coded, but in our problems we will have SPSS dummy-code them because then we do not need to worry about the original codes for the variable as we can always interpret

Page 4: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 4

Dummy-coding in SPSS - 1

When we want SPSS to dummy-code a variable, we enter the specifications in the Define Categorical Variables dialog box. Here we are dummy-coding sex, using the defaults of indicatory coding with the last category as the reference category.

In the table of coefficients, the dummy-coded variable is referred to by its original name plus the value for the Parameter coding in the Categorical Variables Codings table.

SPSS shows you its coding scheme in the table of Categorical Variables Codings in the output. Since we chose the last category as reference, FEMALE is coded 0.

Page 5: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 5

Variables in the Equation

-1.590 .361 19.427 1 .000-.235 .229 1.047 1 .306

sex(1)Constant

Step1

a

B S.E. Wald df Sig.

Variable(s) entered on step 1: sex.a.

Dummy-coding in SPSS - 2

Here we are dummy-coding sex, using the defaults of indicatory coding with the First category as the reference category. Note you must click on the Change button after selecting the First option button.

In the table of coefficients, the dummy-coded variable is still referred to by its original name plus the value for the Parameter coding in the Categorical Variables Codings table, but in this case it stands for females.

SPSS shows you its coding scheme in the table of Categorical Variables Codings in the output. Since we chose the FIRST category as reference, MALE is coded 0.

Page 6: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 6

Assumptions

Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables.

When the variables satisfy the assumptions of normality, linearity, and homogeneity of variance, discriminant analysis has historically been cited as the more effective statistical procedure for evaluating relationships with a non-metric dependent variable. However, logistic regression is being used more and more frequently because it can be interpreted similarly to other general linear model problems.

When the variables do not satisfy the assumptions of normality, linearity, and homogeneity of variance, logistic regression is the statistic of choice since it does not make these assumptions.

Multicollinearity is a problem for logistic regression with the same consequences as multiple regression, i.e. we are likely to misinterpret the contribution of independent variables when they are collinear. SPSS does not compute tolerance values for logistic regression, so we will detect it through the examination of standard errors. We will not interpret models when evidence of multicollinearity is found.

Evidence of multicollinearity is detected as a numerical problem in the attempted solution.

Page 7: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 7

Numerical problems

The maximum likelihood method used to calculate logistic regression is an iterative fitting process that attempts to cycle through repetitions to find an answer.

Sometimes, the method will break down and not be able to converge or find an answer.

Sometimes the method will produce wildly improbable results, reporting that a one-unit change in an independent variable increases the odds of the modeled event by hundreds of thousands or millions. These implausible results can be produced by multicollinearity, categories of predictors having no cases or zero cells, and complete separation whereby the two groups are perfectly separated by the scores on one or more independent variables.

The clue that we have numerical problems and should not interpret the results are standard errors for some independent variables that are larger than 2.0 (this does not apply to the constant).

Page 8: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 8

Sample size requirements

The minimum number of cases per independent variable is 10, using a guideline provided by Hosmer and Lemeshow, authors of Applied Logistic Regression, one of the main resources for Logistic Regression.

If we do not meet the sample size requirement, it is suggested that this be mentioned as a limitation to our analysis. If the relationships between predictors and the dependent variable are strong, we may still attain statistical significance with smaller samples.

Page 9: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 9

Methods for including variables

SPSS supports the three methods for including variables in the regression equation: the standard or simultaneous method in which all independents are included at the

same time The hierarchical method in which control variables are entered in the analysis before

the predictors whose effects we are primarily concerned with. The stepwise method (forward conditional or forward LR in SPSS) in which variables

are selected in the order in which they maximize the statistically significant contribution to the model.

For all methods, the contribution to the model is measures by model chi-square is a statistical measure of the fit between the dependent and independent variables, like R².

Page 10: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 10

Computational method

Multiple regression uses the least-squares method to find the coefficients for the independent variables in the regression equation, i.e. it computed coefficients that minimized the residuals for all cases.

Logistic regression uses maximum-likelihood estimation to compute the coefficients for the logistic regression equation. This method finds attempts to find coefficients that match the breakdown of cases on the dependent variable.

The overall measure of how will the model fits is given by the likelihood value, which is similar to the residual or error sum of squares value for multiple regression. A model that fits the data well will have a small likelihood value. A perfect model would have a likelihood value of zero.

Maximum-likelihood estimation is an iterative procedure that successively tries works to get closer and closer to the correct answer. When SPSS reports the "iterations," it is telling us how may cycles it took to get the answer.

Page 11: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 11

Overall test of relationship

Errors in a logistic regression models are measured in terms of “-2 log likelihood” values which are analogous to “total sum of squares”. When an independent variable has a relationship to the dependent variable the measure of error decreases. Since “-2 log likelihood (abbreviated at -2LL) is measured in negative numbers, an improvement is relationship is indicated by a larger number, e.g. if -2LL were -200, a -2LL of -100 would represent an improvement.

The overall test of relationship among the independent variables and groups defined by the dependent is based on the reduction in the -2 log likelihood values for a model which does not contain any independent variables and the model that contains the independent variables.

This difference in likelihood follows a chi-square distribution, and is referred to as the model chi-square.

The significance test for the model chi-square is our statistical evidence of the presence of a relationship between the dependent variable and the combination of the independent variables.

In a hierarchical logistic regression, the significance test for the addition of the predictor variables is based on the block chi-square in the omnibus tests of model coefficients.

Page 12: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 12

Overall test of relationship in SPSS output

Though the iteration history is not usually an output of interest, it does show us how the model chi-square value is derived.

The original -2 Log Likelihood value is 213.891.

At the end of this step, the -2 Log Likelihood value is 192.726.

213.891 – 192.726 = 21.165, the value for Model Chi-square in the table of Omnibus Tests of Model Coefficients.

Page 13: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 13

Relationship of Individual Independent Variables and Dependent Variable

There is a test of significance for the relationship between an individual independent variable and the dependent variable, a significance test of the Wald statistic .

The individual coefficients represent change in the odds of being a member of the modeled category. Individual coefficients are expressed in log units and are not directly interpretable. However, if the b coefficient is used as the power to which the base of the natural logarithm (2.71828) is raised, the result represents the change in the odds of the modeled event associated with a one-unit change in the independent variable.

If a coefficient is positive, its transformed log value will be greater than one, meaning that the modeled event is more likely to occur. If a coefficient is negative, its transformed log value will be less than one, and the odds of the event occurring decrease. A coefficient of zero (0) has a transformed log value of 1.0, meaning that this coefficient does not change the odds of the event one way or the other.

The interpretive statement for individual relationships, provided they are statistically significant, incorporates the odds ratio or Exp(B) in SPSS output.

Page 14: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 14

Interpreting individual relationships - 1

Exp(B) can be interpreted as a percentage change by subtracting 1.0 from the Exp(B) value.

In this example, Exp(B) – 1.0 = .204 – 1.0 = -.796

We can state this finding as females (sex(1) value in this example) were 79.6% less likely to …

Note: in this example, sex was coded so that males was the reference category.

Page 15: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 15

Interpreting individual relationships - 2

Exp(B) can be interpreted as a multiplier when percentage change is confusing.

We can state this finding as males (sex(1) value in this example) were 4.9 or approximately 5 times more likely to …

In this example, Exp(B) – 1.0 = 4.902 – 1.0 = 3.902, or 390.2% more likely.

Note: in this example, sex was coded so that females was the reference category.

Page 16: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 16

Strength of logistic regression relationship

While logistic regression does compute correlation measures to estimate the strength of the relationship (pseudo R square measures, such as Nagelkerke's R²), these correlations measures do not really tell us much about the accuracy or errors associated with the model.

A more useful measure to assess the utility of a logistic regression model is classification accuracy, which compares predicted group membership based on the logistic model to the actual, known group membership, which is the value for the dependent variable.

Page 17: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 17

Evaluating usefulness for logistic models

The benchmark that we will use to characterize a logistic regression model as useful is a 25% improvement over the rate of accuracy achievable by chance alone.

Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy.

The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group.

Page 18: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 18

Comparing accuracy rates

To characterize our model as useful, we compare the overall percentage accuracy rate produced by SPSS at the last step in which variables are entered to 25% more than the proportional by chance accuracy. (Note: SPSS does not compute a cross-validated accuracy rate for logistic regression.)

Classification Tablea

20 34 37.0

10 72 87.8

67.6

ObservedYES

NO

EXPECT U.S. IN WORLDWAR IN 10 YEARS

Overall Percentage

Step 1YES NO

EXPECT U.S. INWORLD WAR IN 10

YEARS PercentageCorrect

Predicted

The cut value is .500a.

SPSS reports the overall accuracy rate in the Classification Table. The overall accuracy rate computed by SPSS was 67.6% in this example.

Page 19: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 19

Computing by chance accuracy

The number of cases in each group is found in the Classification Table at Step 0 (before any independent variables are included). The proportion of cases in the largest group is equal to the overall percentage (60.3%).

Classification Tablea,b

0 54 .0

0 82 100.0

60.3

ObservedYES

NO

EXPECT U.S. IN WORLDWAR IN 10 YEARS

Overall Percentage

Step 0YES NO

EXPECT U.S. INWORLD WAR IN 10

YEARS PercentageCorrect

Predicted

Constant is included in the model.a.

The cut value is .500b.

The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (0.397² + 0.603² = 0.521).

The proportional by chance accuracy criteria is 65.2% (1.25 x 52.1% = 65.2%).

Since the accuracy rate in this example, 67.6%, is greater than the 65.2% by chance accuracy criteria, this would would be characterized as useful.

Page 20: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 20

Outliers

Logistic regression models the relationship between a set of independent variables and the probablity that a case is a member of one of the categories of the dependent variable (In SPSS, the modeled category is the one with the higher numeric code.) If the probability is greater than 0.5, the case is classified in the modeled category. If the probability is less than 0.50, the case is classified in the other category.

The actual probability of the modeled event for any case is either 1.0 or 0.0, i.e. a case is in the modeled category or it is not.

The residual is the difference between the actual probability and the predicted probability for a case. If the predicted probability for a case that actually belonged to the modeled category was 0.80, the residual would be 1.00 – 0.80 = 0.20.

The residual can be standardized by dividing it by an estimate of its standard deviation. Since the dependent variable is dichotomous or binary, the standard deviation for proportions is used.

Page 21: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 21

Strategy for Outliers

Our strategy for evaluating the impact of outliers on our logistic regression model will parallel what we have done for multiple regression and discriminant analysis:

First, we run a baseline model including all cases Second, we run a model excluding outliers whose studentized residual is

greater than 2.58 or less than -2.58 (z-score for p = .01). If the model excluding outliers has a classification accuracy rate that is 2%

or more higher than the accuracy rate of the baseline model, we will interpret the revised model. If the accuracy rate of the revised model without outliers is less than 2% more accurate, we will interpret the baseline model.

Page 22: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 22

The Problem in BlackboardThe Problem in Blackboard

The problem statement tells us: the variables included in the analysis whether each variable should be

treated as metric or non-metric the type of dummy coding and

reference category for non-metric variables

the alpha for both the statistical relationships and for diagnostic tests

Page 23: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 23

The Statement about Level of Measurement

The first statement in the problem asks about level of measurement. Standard binary logistic regression requires that the dependent variable be dichotomous, the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous. SPSS Binary Logistic Regression calls non-metric variables “categorical.”

SPSS Binary Logistic Regression will dummy-code categorical variables for us, provided it is useful to use either the first or last category as the reference category.

Page 24: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 24

Marking the Statement about Level of Measurement

Mark the check box as a correct statement because:• The dependent variable "computer use" [compuse] is

dichotomous level, satisfying the requirement for the dependent variable.

• The independent variables "highest year of school completed" [educ] and "socioeconomic index" [sei] are interval level, satisfying the requirement for independent variables.

• The independent variable "sex" [sex] is dichotomous level, satisfying the requirement for independent variables.

• The independent variable "condition of health" [health] is ordinal level which the problem instructs us to dummy-code as a non-metric variable.

Page 25: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 25

The Statement about Outliers

While we do not need to be concerned about normality, linearity, and homogeneity of variance, we need to determine whether or not outliers were substantially reducing the classification accuracy of the model.

To test for outliers, we run the binary logistic regression in SPSS and check for outliers. Next, we exclude the outliers and run the logistic regression a second time. We then compare the accuracy rates of the models with and without the outliers. If the accuracy of the model without outliers is 2% or more accurate than the model with outliers, we interpret the model excluding outliers.

Page 26: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 26

Running the standard binary logistic regression

Select the Regression | Binary Logistic… command from the Analyze menu.

Page 27: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 27

Selecting the dependent variable

Second, click on the right arrow button to move the dependent variable to the Dependent text box.

First, highlight the dependent variable compuse in the list of variables.

Page 28: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 28

Selecting the independent variables

Move the independent variables stated in the problem to the Covariates list box.

Page 29: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 29

Declare the categorical variables - 1

To tell SPSS that two of the variables are non-metric and need to be dummy-coded, click on the Categorical button.

Page 30: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 30

Declare the categorical variables - 2

Move the variables sex and health to the Categorical Covariates list box.

SPSS assigns its default method for dummy-coding, Indicator coding, to each variable, placing the name of the coding scheme in parentheses after each variable name.

Page 31: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 31

Declare the categorical variables - 3

We could change the dummy-coding to a different scheme by choosing another method from the drop-down menu, and clicking on the Change button.

However, we will use indicator dummy-coding for our logistic regression problems, so that we are comparing the difference in odds between two specific groups, rather than comparing one group to the average odds for all other groups. I think “average odds” complicates the interpretation.

Page 32: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 32

Declare the categorical variables - 4

We will also accept the default of using the last valid category as the reference category for each variable (we do not use higher numbered missing values as a reference category).

Note that sex is a dichotomous variable, and does not require dummy-coding. I prefer to dummy-code it anyhow so that my interpretation is consistently based on the difference between categories coded 0 and 1. I do not need to alter my interpretation if two different numbers were used for the original coding.

Click on the Continue button to close the dialog box.

Page 33: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 33

Specifying the method for including variables

SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included.

SPSS also supports the specification of "Blocks" of variables for testing hierarchical models.

Since the problem calls for a standard binary logistic regression, we accept the default Enter method for including variables.

Page 34: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 34

Adding outliers to the data set - 1

Click on the Save… button to request the statistics that we want to save.

SPSS will calculate the values for standardized residuals and save them to the data set so that we can check for outliers and remove the outliers easily if we need to run a model excluding outliers.

Page 35: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 35

Adding outliers to the data set - 2

Second, click on the Continue button to complete the specifications.

First, mark the checkbox for Standardized residuals in the Residuals panel.

Page 36: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 36

Requesting the output

While optional statistical output is available, we do not need to request any optional statistics.

Click on the OK button to request the output.

Page 37: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 37

Detecting the presence of outliers - 1

SPSS created a new variable, ZRE_1, which contains the standardized residual. If SPSS finds that the data set already contains a ZRE_1 variable, it will create ZRE_2.

I find it easier to delete the ZRE_1 variable after each analysis rather than have multiple ZRE_ variables in the data set, requiring that I remember which one goes with which analysis.

Page 38: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 38

Detecting the presence of outliers - 2

Click the right mouse button on the column header and select Sort Ascending from the pop-up menu.

To detect outliers, we will sort the ZRE_1 column twice:• first, in ascending order to identify outliers with a

standardized residual of +2.58 or greater.• second, in descending order to identify outliers with

a standardized residual of -2.58 or less.

Page 39: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 39

Detecting the presence of outliers - 3

After scrolling down past the cases with missing data (. in the ZRE_1 column), we see that we have five outliers that have standardized residuals of -2.58 or less.

Page 40: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 40

Detecting the presence of outliers - 4

To check for outliers with large positive standardized residuals, click the right mouse button on the column header and select Sort Ascending from the pop-up menu.

Page 41: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 41

Detecting the presence of outliers - 5

Since we found outliers, we will run the model excluding them and compare accuracy rates to determine which one we will interpret.

Had there been no outliers, we would move on to the issue of sample size.

After scrolling up to the top of the data set, we see that there are no outliers that have standardized residuals of +2.58 or more.

Page 42: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 42

Running the model excluding outliers - 1

We will use a Select Cases command to exclude the outliers from the analysis.

Page 43: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 43

Running the model excluding outliers - 2

Second, click on the If button to specify the condition.

First, in the Select Cases dialog box, mark the option button If condition is satisfied.

Page 44: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 44

Running the model excluding outliers - 3

The formula specifies that we should include cases if the standard score for the standardized residual (ZRE_1) is less than or 2.58.

The abs() or absolute value function tells SPSS to ignore the sign of the value.

After typing in the formula, click on the Continue button to close the dialog box.

To eliminate the outliers, we request the cases that are not outliers be selected into the analysis.

Page 45: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 45

Running the model excluding outliers - 4

SPSS displays the condition we entered on the Select Cases dialog box.

Click on the OK button to close the dialog box.

Page 46: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 46

Running the model excluding outliers - 5

SPSS indicates which cases are excluded by drawing a slash across the case number.

Scrolling down in the data, we see that the outliers and cases with missing values are excluded.

Page 47: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 47

Running the model excluding outliers - 6

To run the logistic regression excluding outliers, select Logistic Regression from the Dialog Recall menu.

Page 48: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 48

Running the model excluding outliers - 7

Click on the Save button to open the dialog box.

The only change we will make is to clear the check box for saving standardized residuals.

Page 49: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 49

Running the model excluding outliers - 8

First, clear the check box for Standardized residuals.

Second, click on the Continue button to close the dialog box.

Page 50: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 50

Running the model excluding outliers - 9

Finally, click on the OK button to request the output.

Page 51: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 51

Accuracy rate of the baseline model including all cases

Navigate to the Classification Table for the logistic regression with all cases. To distinguish the two models, I often refer to the first one as the baseline model.

The accuracy rate for the model with all cases is 75.1%.

Page 52: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 52

Accuracy rate of the revised model excluding outliers

Navigate to the Classification Table for the logistic regression excluding outliers. To distinguish the two models, I often refer to the first one as the revised model.

The accuracy rate for the model excluding outliers is 78.0%.

Page 53: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 53

Marking the statement for excluding outliers

In the initial logistic regression model, 5 cases had a standardized residual of +2.58 or greater or -2.58 or lower: - Case 20000032 had a standardized residual of -3.59 - Case 20000178 had a standardized residual of -5.83 - Case 20001092 had a standardized residual of -2.90 - Case 20001544 had a standardized residual of -4.16 - Case 20002344 had a standardized residual of -3.78

Since the classification accuracy of the model that excluded outliers (78.0%) was greater by 2% or more than the classification accuracy for the model that included all cases (75.1%), we mark the check box for the statement.

All of the remaining statements will be evaluated based on the output for the model that excludes outliers.

Page 54: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 54

The statement about multicollinearity and other numerical problems

Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, cells with a zero count for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted.

Page 55: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 55

Checking for multicollinearity

The standard errors for the variables included in the analysis were: the standard error for "highest year of school completed" [educ] was .11, the standard error for survey respondents who said that their health was poor was 1.44, the standard error for survey respondents who said that their health was fair was .62, the standard error for survey respondents who said that their health was good was .53, the standard error for "socioeconomic index" [sei] was .02 and the standard error for survey respondents who were male was .45.

Page 56: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 56

Marking the statement about multicollinearity and other numerical problems

Since none of the independent variables in this analysis had a standard error larger than 2.0, we mark the check box to indicate there was no evidence of multicollinearity.

Page 57: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 57

The statement about sample size

Hosmer and Lemeshow, who wrote the widely used text on logistic regression, suggest that the sample size should be 10 cases for every independent variable.

Page 58: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 58

The output for sample size

The 164 cases available for the analysis satisfied the recommended sample size of 60 (10 cases per independent variable) for logistic regression recommended by Hosmer and Lemeshow.

We find the number of cases included in the analysis in the Case Processing Summary.

Page 59: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 59

Marking the statement for sample size

Since we satisfy the sample size requirement, we mark the check box.

Page 60: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 60

The overall relationship between the dependent and independent variables

The existence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the model chi-square for the model that includes all of the independent variables.

Page 61: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 61

The output for the overall relationship

In this analysis, the test of the full model versus a model with intercept only was statistically significant, χ²(6, N = 164) = 88.44, p < .001. The null hypothesis that there is no difference between the model with only a constant and the model with independent variables was rejected.

The existence of a relationship between the independent variables and the dependent variable was supported.

Page 62: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 62

Marking the statement for overall relationship

Since the overall relationship was statistically significant, we mark the check box.

Page 63: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 63

The statement about the relationship between education and computer use

Having satisfied the criteria for an overall relationship, we examine the findings for individual relationships with the dependent variable. If the overall relationship were not significant, we would not interpret the individual relationships.

The first statement concerns the relationship between education and computer use.

Page 64: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 64

Output for the relationship between education and computer use

The probability of the Wald statistic for the independent variable "highest year of school completed" [educ] (χ²(1, N = 164) = 11.49, p < .001) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for "highest year of school completed" [educ] was equal to zero was rejected. The value of Exp(B) for the variable "highest year of school completed" [educ] was 1.43 which implies an increase in the odds of 43.2% (1.43 - 1.0 = .43). The statement that 'For each unit increase in "highest year of school completed", survey respondents were 43.2% more likely to use a computer' is correct.

Page 65: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 65

Marking the statement for the relationship between education and computer use

Since the relationship was statistically significant, and the odds ratio was correctly interpreted as an increase of 43.2%, we mark the check box for the statement.

Page 66: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 66

The statement for the relationship between poor health and computer use

The next statement concerns the relationship between the dummy-coded variable for poor health and computer use.

Page 67: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 67

Output for the relationship between poor health and computer

The probability of the Wald statistic for the independent variable survey respondents who said that their health was poor (χ²(1, N = 164) = 8.20, p = .004) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that their health was poor was equal to zero was rejected. The value of Exp(B) for the variable survey respondents who said that their health was poor was .016 which implies a decrease in the odds of 98.4% (.016 - 1.000 = -.984). The statement that 'Survey respondents who said that their health was poor were 98.4% less likely to use a computer compared to those who said that their health was excellent' is correct.

Page 68: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 68

Marking the statement for the relationship between poor health and computer use

Since the relationship was statistically significant, and the odds ratio was correctly interpreted as a decrease of 98.4% compared to the reference group in excellent health, we mark the check box for the statement.

Page 69: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 69

The statement for the relationship between fair health and computer use

The next statement concerns the relationship between the dummy-coded variable for fair health and computer use.

Page 70: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 70

Output for the relationship between fair health and computer use

The probability of the Wald statistic for the independent variable survey respondents who said that their health was fair (χ²(1, N = 164) = 6.60, p = .010) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that their health was fair was equal to zero was rejected. The value of Exp(B) for the variable survey respondents who said that their health was fair was .204 which implies a decrease in the odds of 79.6% (.204 - 1.000 = -.796). The statement that 'Survey respondents who said that their health was fair were 79.6% less likely to use a computer compared to those who said that their health was excellent' is correct.

Page 71: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 71

Marking the statement for the relationship between fair health and computer use

Since the relationship was statistically significant, and the odds ratio was correctly interpreted as a decrease of 79.6% compared to the reference group in excellent health, we mark the check box for the statement.

Page 72: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 72

The statement for the relationship between good health and computer use

The next statement concerns the relationship between the dummy-coded variable for good health and computer use.

Page 73: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 73

Output for the relationship between good health and computer use

The probability of the Wald statistic for the independent variable survey respondents who said that their health was good (χ²(1, N = 164) = 1.53, p = .216) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that their health was good was equal to zero was not rejected. Survey respondents who said that their health was good does not have an impact on the odds that survey respondents use a computer. The analysis does not support the relationship that 'Survey respondents who said that their health was good were 48.4% less likely to use a computer compared to those who said that their health was excellent'

Page 74: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 74

Marking the statement for the relationship between good health and computer use

Since the relationship was not statistically significant, the check box is not marked.

Page 75: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 75

The statement for relationship between socioeconomic index and computer use

The next statements concern the relationship between socioeconomic index and computer use. We are offered two alternative interpretations of the direction of the relationship. If the relationship is not statistically significant, neither will be correct.

Page 76: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 76

Output for the relationship between socioeconomic index and computer use

The probability of the Wald statistic for the independent variable "socioeconomic index" [sei] (χ²(1, N = 164) = 16.93, p < .001) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for "socioeconomic index" [sei] was equal to zero was rejected. The value of Exp(B) for the variable "socioeconomic index" [sei] was 1.070 which implies an increase in the odds of 7.0% (1.070 - 1.000 = .070). The statement that 'For each unit increase in "socioeconomic index", survey respondents were 7.0% more likely to use a computer' is correct.

Page 77: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 77

Marking the relationship between socioeconomic index and computer use

Since the relationship was statistically significant and the odds ratio indicated an increase of 7.0%, the first statement is marked and the second is left blank.

Page 78: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 78

The statement for the relationship between sex and computer use

The next statement concerns the relationship between the sex and computer use.

Page 79: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 79

Output for the relationship between sex and computer use

The probability of the Wald statistic for the independent variable survey respondents who were male (χ²(1, N = 164) = 2.10, p = .148) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who were male was equal to zero was not rejected. Survey respondents who were male does not have an impact on the odds that survey respondents use a computer. The analysis does not support the relationship that 'Survey respondents who were male were 47.8% less likely to use a computer compared to those who were female'

Page 80: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 80

Marking the statement for the relationship between sex and computer use

Since the relationship was not statistically significant, the check box is not marked.

Page 81: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 81

Statement about the usefulness of the model based on classification accuracy

The final statement concerns the usefulness of the logistic regression model. The independent variables could be characterized as useful predictors distinguishing survey respondents who use a computer from survey respondents who not use a computer if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.

Page 82: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 82

Computing proportional by-chance accuracy rate

The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (.396² + .604² = .521).

The proportion in the largest group is 60.4% or .604. The proportion in the other group is 1.0 – 0.604 = .396.

At Block 0 with no independent variables in the model, all of the cases are predicted to be members of the modal group, 1=Yes in this example.

Page 83: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 83

Output for the usefulness of the model based on classification accuracy

To be characterized as a useful model, the accuracy rate should be 25% higher than the by chance accuracy rate.

The by chance accuracy criteria is compute by multiplying the by chance accurate rate of .521 times 1.25, or 1.25 x .521 = .652 (65.2%)..

The classification accuracy rate computed by SPSS was 78.0% which was greater than or equal to the proportional by chance accuracy criteria of 65.2% (1.25 x 52.1% = 65.2%).

The criteria for classification accuracy is satisfied.

Page 84: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 84

Marking the statement for usefulness of the model

Since the criteria for classification accuracy was satisfied, the check box is marked.

Page 85: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 85

Standard Binary Logistic Regression: Level of Measurement

No

No

Ordinal level variable treated as metric?

Yes

Yes

Level of measurement ok?

Consider limitation in discussion of findings

Mark check box for level of measurement

Do not mark check box for level of measurement

Mark: Inappropriate application of the statistic

Stop

Page 86: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 86

Standard Binary Logistic Regression: Exclude Outliers

Run Baseline Binary Logistic Regression, Including All Cases,

Requesting Standardized Residuals

No

YesAccuracy rate for revisedModel >= accuracy rate for baseline model + 2%

Run Revised Binary Logistic Regression, Excluding Outliers (standardized

Residuals >= 2.58)

Interpret baseline model

Interpret revised model

Mark check box for excluding outliers

Do not mark check box for excluding outliers

Page 87: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 87

Standard Binary Logistic Regression: Multicollinearity and Sample Size

No

YesMulticollinearity/Numerical Problems (S. E. > 2.0)

Stop

Yes

NoAdequate Sample Size(Number of IV’s x 10)

Consider limitation in discussion of findings

Mark check box for no multicollinearity

Do not mark check box for no multicollinearity

Mark check box for sample size

Do not mark check box for sample size

Page 88: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 88

Standard Binary Logistic Regression: Overall Relationship

Probability of Model Chi-square ≤ α

Yes

Mark check box for overall relationship

Do not mark check box for overall relationship

No

Stop

Page 89: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 89

Standard Binary Logistic Regression: Individual Relationships

Yes

Individual relationship(Wald Sig ≤ α)?

No

Mark check box for individual relationship

Correct interpretation of direction and strength of

relationship?

Yes

Do not mark check box for individual relationship

No

Additional individualRelationships to

interpret?Yes

No

Page 90: Slide 1 Standard Binary Logistic Regression. Slide 2 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous

Slide 90

Standard Binary Logistic Regression: Classification Accuracy

Yes

Classification accuracy > 1.25 x by chance

accuracy rate

Do not mark check box for classification accuracy

No

Mark check box for classification accuracy


Top Related