multinomial logistic regression: detecting outliers and validating analysis outliers split-sample...

Click here to load reader

Post on 21-Dec-2015

313 views

Category:

Documents

10 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation
  • Slide 2
  • Outliers Multinomial logistic regression in SPSS does not compute any diagnostic statistics. In the absence of diagnostic statistics, SPSS recommends using the Logistic Regression procedure to calculate and examine diagnostic measures. A multinomial logistic regression for three groups compares group 1 to group 3 and group 2 to group 3. To test for outliers, we will run two binary logistic regressions, using case selection to compare group 1 to group 3 and group 2 to group 3. From both of these analyses we will identify a list of cases with studentized residuals greater than 2.0, and test the multinomial solution without these cases. If the accuracy rate of this model is less than 2% more accurate, we will interpret the model that includes all cases.
  • Slide 3
  • Example To demonstrate the process for detecting outliers, we will examine the relationship between the independent variables "age" [age],"highest year of school completed" [educ] and "confidence in banks and financial institutions" [confinan] and the dependent variable "opinion about spending on social security" [natsoc]. Opinion about spending on social security contains three categories: 1 too little 2 about right 3 too much With all cases, including those that might be identified as outliers, the accuracy rate was 63.7%. We note this to compare with the classification accuracy after removing outliers to determine which model we will interpret.
  • Slide 4
  • Request multinomial logistic regression for baseline model Select the Regression | Multinomial Logistic command from the Analyze menu.
  • Slide 5
  • Selecting the dependent variable Second, click on the right arrow button to move the dependent variable to the Dependent text box. First, highlight the dependent variable natsoc in the list of variables.
  • Slide 6
  • Selecting metric independent variables Move the metric independent variables, age, educ and confinan to the Covariate(s) list box. Metric independent variables are specified as covariates in multinomial logistic regression. Metric variables can be either interval or, by convention, ordinal.
  • Slide 7
  • Specifying statistics to include in the output While we will accept most of the SPSS defaults for the analysis, we need to specifically request the classification table. Click on the Statistics button to make a request.
  • Slide 8
  • Requesting the classification table First, keep the SPSS defaults for Model and Parameters. Second, mark the checkbox for the Classification table. Third, click on the Continue button to complete the request.
  • Slide 9
  • Completing the multinomial logistic regression request Click on the OK button to request the output for the multinomial logistic regression. The multinomial logistic procedure supports additional commands to specify the model computed for the relationships (we will use the default main effects model), additional specifications for computing the regression, and saving classification results. We will not make use of these options.
  • Slide 10
  • Classification accuracy for all cases With all cases, including those that might be identified as outliers, the accuracy rate was 63.7%. We will compare the classification accuracy of the model with all cases to the classification accuracy of the model excluding outliers.
  • Slide 11
  • Outliers for the comparison of groups 1 and 3 Since multinomial logistic regression does not identify outliers, we will use binary logistic regressions to identify them. Choose the Select Cases command from the Data menu to include only groups 1 and 3 in the analysis.
  • Slide 12
  • Selecting groups 1 and 3 First, mark the If condition is satisfied option button. Second, click on the IF button to specify the condition.
  • Slide 13
  • Formula for selecting groups 1 and 3 To include only groups 1 and 3 in the analysis, we enter the formula to include cases that had a value of 1 for natsoc or a value of 3 for natsoc. After completing the formula, click on the Continue button to close the dialog box.
  • Slide 14
  • Completing the selection of groups 1 and 3 To activate the selection, click on the OK button.
  • Slide 15
  • Binary logistic regression comparing groups 1 and 3 Select the Regression | Binary Logistic command from the Analyze menu.
  • Slide 16
  • Dependent and independent variables for the comparison of groups 1 and 3 Second, move the independent variables age, educ, and confinan to the Covariates list box. Third, click on the Save button to request the inclusion of standardized residuals in the data set. First, move the dependent variable natsoc to the Dependent variable text box.
  • Slide 17
  • Including studentized residuals in the comparison of groups 1 and 3 Second, click on the Continue button to complete the specifications. First, mark the checkbox for Studentized residuals in the Residuals panel.
  • Slide 18
  • Outliers for the comparison of groups 1 and 3 Click on the OK button to request the output for the logistic regression.
  • Slide 19
  • Locating the case ids for outliers for groups 1 and 3 In order to exclude outliers from the multinomial logistic regression, we must identify their case ids. Choose the Select Cases command from the Data menu to identify cases that are outliers.
  • Slide 20
  • Replace the selection criteria To replace the formula that selected cases in group 1 and 3 for the dependent variable, click on the IF button.
  • Slide 21
  • Formula for identifying outliers Type in the formula for including outliers. Note that we are including outliers because we want to identify them. This is different that previous procedures where we included cases that were not outliers in the analysis. Click on the Continue button to close the dialog box.
  • Slide 22
  • Completing the selection of outliers To activate the selection, click on the OK button.
  • Slide 23
  • Locating the outliers in the data editor We used Select cases to specify a criteria for including cases that were outliers. Select cases will assign a 1 (true) to the filter_$ variable if a cases satisfies the criteria. To locate the cases that have a filter_$ value of 1, we can sort the data set in descending order of the values for the filter variable. Click on the column header for filter_$ and select Sort Descending from the drop down menu.
  • Slide 24
  • The outliers in the data editor At the top of the sorted column for filter_$, we see four 1s indicating that 4 cases met the criteria for being considered an outlier.
  • Slide 25
  • Outliers for the comparison of groups 2 and 3 Since multinomial logistic regression does not identify outliers, we will use binary logistic regressions to identify them. Choose the Select Cases command from the Data menu to include only groups 2 and 3 in the analysis. The process for identifying outliers is repeated for the other comparison done by the multinomial logistic regression, group 2 versus group 3.
  • Slide 26
  • Selecting groups 2 and 3 First, mark the If condition is satisfied option button. Second, click on the IF button to change the condition.
  • Slide 27
  • Formula for selecting groups 2 and 3 To include only groups 2 and 3 in the analysis, we enter the formula to include cases that had a value of 2 for natsoc or a value of 3 for natsoc. After completing the formula, click on the Continue button to close the dialog box.
  • Slide 28
  • Completing the selection of groups 2 and 3 To activate the selection, click on the OK button.
  • Slide 29
  • Binary logistic regression comparing groups 2 and 3 Select the Regression | Binary Logistic command from the Analyze menu.
  • Slide 30
  • Outliers for the comparison of groups 2 and 3 Click on the OK button to request the output for the logistic regression. The specifications for the analysis are the same as the ones we used for detecting outliers for groups 1 and 3.
  • Slide 31
  • Locating the case ids for outliers for groups 2 and 3 In order to exclude outliers from the multinomial logistic regression, we must identify their case ids. Choose the Select Cases command from the Data menu to identify cases that are outliers.
  • Slide 32
  • Replace the selection criteria To replace the formula that selected cases in group 2 and 3 for the dependent variable, click on the IF button.
  • Slide 33
  • Formula for identifying outliers Type in the formula for including outliers. Note that we use the second version of the standardized residual, sre_2. Click on the Continue button to close the dialog box.
  • Slide 34
  • Completing the selection of outliers To activate the selection, click on the OK button.
  • Slide 35
  • Locating the outliers in the data editor We used Select cases to specify a criteria for including cases that were outliers. Select cases will assign a 1 (true) to the filter_$ variable if a cases satisfies the criteria. To locate the cases that have a filter_$ value of 1, we can sort the data set in descending order of the values for the filter variable. Click on the column header for filter_$ and select Sort Descending from the drop down menu.
  • Slide 36
  • The outliers in the data editor At the top of the sorted column for filter_$, we see that we have two outliers. These two outliers were among outliers for the analysis of groups 1 and 3.
  • Slide 37 P3) PREDGRP = 1. IF (P2 > P1 AND P2 > P3) PREDGRP = 2. IF (P3 > P1 AND P3 > P2) PREDGRP = 3. EXECUTE.">
  • Group classification Each case is predicted to be a member of the group to which it has the highest probability of belonging. We can accomplish this using "IF" statements in SPSS: IF (P1 > P2 AND P1 > P3) PREDGRP = 1. IF (P2 > P1 AND P2 > P3) PREDGRP = 2. IF (P3 > P1 AND P3 > P2) PREDGRP = 3. EXECUTE.
  • Slide 79
  • Selecting the holdout sample - 1 To select the cases that we will use to compute classification accuracy for the holdout group, we will use the Select Cases command again. Our calculations predicted group membership for all cases in the data set, including the training sample. To compute the classification accuracy for the holdout sample, we will have to explicitly include only the holdout sample in the calculations.
  • Slide 80
  • Selecting the holdout sample - 2 First, mark the If condition is satisfied option button. Second, click on the IF button to specify the condition.
  • Slide 81
  • Selecting the holdout sample - 3 To include the cases in the 25% holdout sample, we enter the criterion: "split = 0". After completing the formula, click on the Continue button to close the dialog box.
  • Slide 82
  • Selecting the holdout sample - 4 To activate the selection, click on the OK button.
  • Slide 83
  • The crosstabs classification accuracy table - 1 The classification accuracy table is a table of predicted group membership versus actual group membership. SPSS can create it as a cross-tabulated table. Select the Crosstabs | Descriptive Statistics command from the Analyze menu.
  • Slide 84
  • The crosstabs classification accuracy table - 2 To mimic the appearance of classification tables in SPSS, we will put the original variable, natsoc, in the rows of the table and the predicted group variable, predgrp, in the columns. After specifying the row and column variables, we click on the Cells button to request percentages.
  • Slide 85
  • The crosstabs classification accuracy table - 3 Second, click on the Continue button to close the dialog box. The classification accuracy rate will be the sum of the total percentages on the main diagonal. First, to obtain these percentage, mark the check box for Total on the Percentages panel.
  • Slide 86
  • The crosstabs classification accuracy table - 4 To complete the request for the cross-tabulated table, click on the OK button.
  • Slide 87
  • The crosstabs classification accuracy table - 5 The classification accuracy rate will be the sum of the total percentages on the main diagonal: 51.2% + 12.2% = 63.4%. The criteria to support the classification accuracy of the model is an accuracy rate for the holdout sample that has no more than 2% shrinkage from the accuracy rate for the training sample. The accuracy rate for the training sample was 66.2%. The shrinkage was 66.2% - 63.4% = 2.8%. The shrinkage in the accuracy rate for the holdout sample does not satisfy the requirement. The classification accuracy for the analysis of the full data set was not supported.