slide 1 multinomial logistic regression: a problem in personnel classification in our first class on...

Multinomial Logistic Regression: A Problem in Personnel Classification

In our first class on Multinomial Logistic Regression we studied Rulon's problem in personnel classification as an example of discriminant alalysis with a three-group categorical variable.

We found that there were two statistically significant discriminant functions distinguishing the three groups (mechanics, passenger agents, and operations control). The three predictor variables (outdoor activity score, convivial score, and conservative score) all had a statistically significant relationship to membership in the dependent variable groups. The cross-validated accuracy rate for the discriminant model was 75.0%.

Our interpretation of the variables and functions concluded that:

The first discriminant function distinguishes Passenger Agents from Mechanics and Operations Control staff. The two variables that are important on the first function are convivial score and conservative score. Passenger agents had higher convivial scores and lower conservative scores than the other two groups.

Operations Control staff are distinguished from Mechanics by the second discriminant function which contains only a single variable, the outdoor activity Score. Mechanics had a higher average on the outdoor activity score than did Operations Control staff.

In sum, Passenger Agents are more outgoing (convivial) and more tolerant (less conservative) than Mechanics and Operations Control personnel. Mechanics differ from Operations Control personnel in their stronger preference for outdoor oriented activities.

Multinomial Logistic Regression


This problem is from Phillip J. Rulon, David V. Tiedeman, Maurice Tatsuoka, and Charles R. Langmuir. Multivariate Statistics for Personnel Classification. 1967.

This sample data is for "World Airlines, a company employing over 50,000 persons and operating scheduled flights. This company naturally needs many men who can be assigned to a particular set of functions. The mechanics on the line who service the equipment of World Airlines form one of the groups we shall consider. A second group are the agents who deal with the passengers of the airline. A third group are the men in operations who coordinate airline activities.

The personnel officer of World Airlines has developed an Activity Preference Inventory for the use of the airline. The first section of this inventory contains 30 pairs of activities, each pair naming an indoor activity and an outdoor activity. One item is

_____ Billiards : Golf _____

The applicant for a job in World Airlines checks the activity he prefers. The score is the number of outdoor activities marked." (page 24)

The second section of the Activity Preference Inventory "contains 35 items. One activity of each pair is a solitary activity, the other convivial. An example is

_____ Solitaire : Bridge _____

The apprentice's score is the number of convivial activities he prefers." (page 82)



The third section of the Activity Preference Inventory "contains 25 items. One activity of each pair is a liberal activity, the other a conservative activity. An example is

_____ Counseling : Advising _____

The apprentice's score is the number of conservative activities he prefers." (page 153)

The Activity Preference Inventory was administered to 244 employees in the three job classifications who were successful and satisfied with their jobs. The dependent variable, JOBCLASS 'Job Classification' included three job classifications: 1 - Passenger Agents, 2 - Mechanics, and 3 - Operations Control.

The purpose of the analysis is to develop a classification scheme based on scores on the Activity Preference Inventory to assign new employees to the different job groups. The data for this problem are in the file ActivityPreferenceInventory.Sav. We will re-analyze this problem with multinomial logistic regression.


Stage One: Define the Research Problem

In this stage, the following issues are addressed:

•Relationship to be analyzed•Specifying the dependent and independent variables•Method for including independent variables


Relationship to be analyzed

We are interested in the relationship between scores on the three scales of the Activity Preference Inventory and the different job classifications.

Specifying the dependent and independent variables

The dependent variable is:•JOBCLASS 'Job Classification'

coded as 1 = Passenger Agents, 2 = Mechanics, and 3 = Operations Control.

The independent variables are:•OUTDOOR, 'Outdoor Activity Score'•CONVIV, 'Convivial Score'•CONSERV, 'Conservative Score'

Method for including independent variables

Direct entry of the independent variables is the only method for including variables available with the SPSS Multinomial Logistic Regression procedure.

Stage 2: Develop the Analysis Plan: Sample Size Issues


•Missing data analysis•Minimum sample size requirement: 15-20 cases per independent variable


Missing data analysis

There is no missing data in this problem.

Minimum sample size requirement:15-20 cases per independent variable

The data set contains 244 subjects and 3 independent variables. The ratio of 81 cases per independent variable exceeds the minimum sample size requirement

Stage 2: Develop the Analysis Plan: Measurement Issues:


•Incorporating nonmetric data with dummy variables•Representing Curvilinear Effects with Polynomials•Representing Interaction or Moderator Effects


Incorporating Nonmetric Data with Dummy Variables

All of the variables are metric.

Representing Curvilinear Effects with Polynomials

We do not have any evidence of curvilinear effects at this point in the analysis.

Representing Interaction or Moderator Effects

We do not have any evidence at this point in the analysis that we should add interaction or moderator variables.

Stage 3: Evaluate Underlying Assumptions


•Nonmetric dependent variable with more than two groups•Metric or dummy-coded independent variables


Nonmetric dependent variable having more than two groups

The dependent variable JOBCLASS 'Job Classification' has three groups.

Metric or dummy-coded independent variables

All of the independent variables are metric.

Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Model Estimation

In this stage, the following issues are addressed as part of model estimation:

•Compute logistic regression model


Compute the logistic regression

The steps to obtain a logistic regression analysis are detailed on the following screens.

Requesting a Logistic Regression


Specifying the Dependent Variable


Specifying the Independent Variables


Specifying the Statistics to Include in the Output


Complete the Logistic Regression Request


Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Assessing Model Fit


•Significance test of the model log likelihood (Change in -2LL)•Measures Analogous to R²: Cox and Snell R² and Nagelkerke R²•Classification matrices as a measure of model accuracy•Check for Numerical Problems•Presence of outliers


Significance test of the model log likelihood

The Initial Log Likelihood Function, (-2 Log Likelihood or -2LL) is a statistical measure like total sums of squares in regression. If our independent variables have a relationship to the dependent variable, we will improve our ability to predict the dependent variable accurately, and the log likelihood measure will decrease.

The initial log likelihood value (529.883) is a measure of a model with no independent variables, i.e. only the intercept as a constant. The final log likelihood value (280.240) is the measure computed after all of the independent variables have been entered into the logistic regression. The difference between these two measures is the model chi-square value (249.643 = 529.883- 280.240) that is tested for statistical significance. This test is analogous to the F-test for R² or change in R² value in multiple regression which tests whether or not the improvement in the model associated with the additional variables is statistically significant.

In this problem the model Chi-Square value of 118.497 has a significance < 0.0001, so we conclude that there is a significant relationship between the dependent variable and the set of independent variables.


Measures Analogous to R²

The next SPSS outputs indicate the strength of the relationship between the dependent variable and the independent variables, analogous to the R² measures in multiple regression.

The Cox and Snell R² measure operates like R², with higher values indicating greater model fit. However, this measure is limited in that it cannot reach the maximum value of 1, so Nagelkerke proposed a modification that had the range from 0 to 1. We will rely upon Nagelkerke's measure as indicating the strength of the relationship.

If we applied our interpretive criteria to the Nagelkerke R², we would characterize the relationship as very strong.


The Classification Matrices as a Measure of Model Accuracy - 1

The classification matrices in logistic regression serve the same function as the classification matrices in Multinomial Logistic Regression, i.e. evaluating the accuracy of the model.

If the predicted and actual group memberships are the same, i.e. 1 and 1, 2 and 2, or 3 and 3, then the prediction is accurate for that case. If predicted group membership and actual group membership are different, the model "misses" for that case. The overall percentage of accurate predictions (75.4% in this case) is the measure of a model that I rely on most heavily for this analysis as well as for Multinomial Logistic Regression because it has a meaning that is readily communicated, i.e. the percentage of cases for which our model predicts accurately.

To evaluate the accuracy of the model, we compute the proportional by chance accuracy rate and the maximum by chance accuracy rates, if appropriate.

The proportional chance criteria for assessing model fit is calculated by summing the squared proportion that each group represents of the sample, in this case (0.348 x 0.348) + (0.381 x 0.381 ) + (0.270 x 0.270) = 0.339. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 0.339= 0.4237. Our model accuracy rate of 75.4% exceeds this standard.Multinomial Logistic Regression

The Classification Matrices as a Measure of Model Accuracy - 2

The maximum chance criteria is the proportion of cases in the largest group, 34.8% in this problem. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 34.8% = 43.5%. Our model accuracy rate of 75.4% exceeds this standard.


Check for Numerical Problems

There are several numerical problems that can occur in logistic regression that are not detected by SPSS or other statistical packages: multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and "complete separation" whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables.

All of these problems produce large standard errors (over 2) for the variables included in the analysis and very often produce very large B coefficients as well. If we encounter large standard errors for the predictor variables, we should examine frequency tables, one-way ANOVAs, and correlations for the variables involved to try to identify the source of the problem.

None of the standard errors or B coefficients are excessively large, so there is no evidence of a numeric problem with this analysis.


Demonstrating Multicollinearity

To demonstrate the identification of multicollinearity, I created a duplicate variable named OT2 that was identical to OUTDOOR for all but one case. SPSS provides us with a warning on the Likelihood Ratio Test. In the table of parameter estimates, we find very large values for the standard errors of the collinear variables, OUTDOOR and OT2.


Presence of outliers


Multinomial logistic regression does not provide any output for detecting outliers. However, if we are concerned with outliers, we can identify outliers on the combination of independent variables by computing Mahalanobis distance in the SPSS regression procedure.

Stage 5: Interpret the Results

In this section, we address the following issues:

•Identifying the statistically significant predictor variables•Direction of relationship and contribution to dependent variable


Identifying the statistically significant predictor variables

There are two outputs related to the statistical significance of individual predictor variables: the Likelihood Ratio Tests and Parameter Estimates. The Likelihood Ratio Tests indicate the contribution of the variable to the overall relationship between the dependent variable and the individual independent variables. The Parameter Estimates focus on the role of each independent variable in differentiating between the groups specified by the dependent variable.

In addition, all three variables play a significant role on both of the logistic regression equations.


The likelihood ratio tests are a hypothesis test that the variable contributes to the reduction is error measured by the –2 log likelihood statistic. In this model, the three variables are all significant contributors to explaining differences in job classification.

Direction of relationship and contribution to dependent variable - 1

Interpretation of the independent variables is based on the sign of the coefficient in the B column and aided by the "Exp (B)" column which contains the odds ratio for each independent variable.

We can state the relationships as follows:


Direction of relationship and contribution to dependent variable - 2


We can state the relationships as follows:

Higher outdoor activity scores decrease the likelihood that the subject is a passenger agent instead of operations control and increase the likelihood that the subject is a mechanic rather than operations control. A one-unit increase in outdoor activity score is associated with a 22% decrease (0.778 – 1.0) in the odds of being classified as a passenger agent. A one-unit increase in outdoor activity score is associated with a 23% increase (1.233 – 1.0) in the odds of being classified as a mechanic.

Higher convivial scores increase the likelihood that the subject is a passenger agent or a mechanic, rather than operations control. It would probably make more sense to our audience to reverse the direction of these relationships and combine them to state that higher convivial scores decrease the likelihood that the subject is operations control, rather than a passenger agent or a mechanic. A one-unit increase in convivial score is associated with a 44% decrease in the likelihood of being operations control rather than a passenger agent (1/1.77 – 1.0). A one-unit increase in convivial score is associated with a 29% decrease in the likelihood of being operations control rather than a mechanic (1/1.408 – 1.0).

Higher conservative scores decrease the likelihood of being a passenger agent rather than operations control by 38% (0.624 – 1.0), and decrease the likelihood of being a mechanic rather than operations control by 29% (0.713 – 1.0).

Stage 6: Validate The Model

The first step in our validation analysis is to create the split variable.

* Compute the split variable for the learning and validation samples.SET SEED 2000000.COMPUTE split = uniform(1) > 0.50 .EXECUTE .


Creating the Multinomial Logistic Regression for the First Half of the Data

Next, we run the multinomial logistic regression on the first half of the sample, where split = 0.

* Select the cases to include in the first validation analysis.USE ALL.COMPUTE filter_$=(split=0).FILTER BY filter_$.EXECUTE .

* Run the multinomial logistic regression for these cases.NOMREG jobclass WITH outdoor conviv conserv /CRITERIA = CIN(95) DELTA(0) MXITER(100) MXSTEP(5) LCONVERGE(0) PCONVERGE(1.0E-6) SINGULAR(1.0E-8) /MODEL /INTERCEPT = INCLUDE /PRINT = CLASSTABLE PARAMETER SUMMARY LRT .


Entering the Logistic Regression Coefficients into SPSS

To compute the classification scores for the logistic regression equations, we need to enter the coefficients for each equation into SPSS.

Next, we enter the B coefficients into SPSS using compute commands. For the first set of coefficients, we will use the letter A, followed by a number. For the second set of coefficients, we will use the letter B, followed by a number. The complete set of compute commands are below the graphic.


Create the coefficients in SPSS

* Assign the coefficients from the model just run to variables.Compute A0 = -2.4701157307273.Compute A1 = -0.241834593548177.Compute A2 = 0.593813821000787.Compute A3 = -0.527590120671775.Compute B0 = -5.7234318621925.Compute B1 = 0.248534933393711.Compute B2 = 0.346110750220431.Compute B3 = -0.382239899553461.Execute.


Entering the Logistic Regression Equations into SPSS

The logistic regression equations can be entered as compute statements. We will also enter the zero value for the third group, g3.

Compute g1 = A0 + A1 * OUTDOOR + A2 * CONVIV + A3 * CONSERV.Compute g2 = B0 + B1 * OUTDOOR + B2 * CONVIV + B3 * CONSERV.Compute g3 = 0.Execute.

When these statements are run in SPSS, the scores for g1, g2, and g3 will be added to the dataset.


Converting Classification Scores into Predicted Group Membership

We convert the three scores into odds ratios using the EXP function. When we divide each score by the sum of the three odds ratios, we end up with a probability of membership in each group.

* Compute the probabilities of membership in each group.Compute p1 = exp(g1) / (exp(g1) + exp(g2) + exp(g3)).Compute p2 = exp(g2) / (exp(g1) + exp(g2) + exp(g3)).Compute p3 = exp(g3) / (exp(g1) + exp(g2) + exp(g3)).Execute.

The follow if statements compare probabilities to predict group membership.

* Translate the probabilities into predicted group membership.If (p1 > p2 and p1 > p3) predgrp = 1.If (p2 > p1 and p2 > p3) predgrp = 2.If (p3 > p1 and p3 > p2) predgrp = 3.Execute.

When these statements are run in SPSS, the dataset will have both actual and predicted membership for the first validation sample.


The Classification Table

To produce a classification table for the validation sample, we change the filter criteria to include cases where split = 1, and create a contingency table of predicted voting versus actual voting.

USE ALL.COMPUTE filter_$=(split=1).FILTER BY filter_$.EXECUTE.CROSSTABS /TABLES=jobclass BY predgrp /FORMAT= AVALUE TABLES /CELLS= COUNT TOTAL .

These command produce the following table. The classification accuracy rate is computed by adding the percents for the cells where predicted accuracy coincides with actual voting behavior: 23.1% + 34.7% + 19.8% = 77.6%.

We enter this information in the validation table.Multinomial Logistic Regression

Computing the Second Validation Analysis

The second validation analysis follows the same series of command, except that we build the model with the cases where split = 1 and validate the model on cases where split = 0. The results from my calculations have been entered into the validation table below.


Generalizability of the Multinomial Logistic Regression Model

We can summarize the results of the validation analyses in the following table.


From the validation table, we see that the original model is verified by the accuracy rates for the validation analyses. The significant predictors are the same in all analysis so the validation has been successful.

Full Model Split = 0 Split = 1

Model Chi-Square 249.643, p < 0.0001 128.409, p < 0.0001 118,992, p < 0.0001

Nagelkerke R2 0.722 0.731 0.708

Accuracy Rate forLearning Sample

75.4% 74.8% 76.9%

Accuracy Rate for

Validation Sample

77.6% 73.1%

Significant Coefficients

(p < 0.05)

OUTDOOR CONVIV CONSERV



Multinomial Logistic Regression Results versus Discriminant Analysis Results - 1Both analyses found that the three predictor variables were useful and accurate in

identifying differences between the three classifications of personnel. Reconciling the differences between interpretation of the individual coefficients requires more work.

In the logistic regression, we found that higher outdoor activity scores decrease the likelihood that the subject is a passenger agent instead of operations control and increase the likelihood that the subject is a mechanic rather than operations control. A higher outdoor activity score increasing the likelihood that a person is a mechanic rather than operations control is consistent with the finding in discriminant analysis that mechanics had a higher average outdoor activity score than did operations control staff.

In interpreting convivial scores, we stated that higher convivial scores decrease the likelihood that the subject is operations control, rather than a passenger agent or a mechanic. This would imply that the average convivial score was lower for operations control than the other two groups. While this was verified by the discriminant analysis, we identified the difference as passenger agents having a higher convivial score than either mechanics or operations control. From the comparison thus far, we can conclude that passenger agents had higher average convivial scores than operations control, but it takes more work to verify that passenger agents had higher convivial scores than mechanics. We can compare the contribution of convivial scores to distinguishing passenger agents from mechanics by subtracting the B coefficient of convivial scores for mechanics from the B coefficient of convivial scores for passenger agents, 0.571-0.342 = 0.229. The odds ratio for passenger agents versus mechanics is EXP(0.229) = 1.258. Higher convivial scores increased the likelihood that a subject was a passenger agent rather than a mechanic, implying that passenger agents had higher average convivial scores than mechanics. The difference between the interpretation of convivial scores for the two analyses is a difference in emphasis rather than an inconsistency in the results of the analysis, even though we might not be able to reconcile the differences immediately.

Multinomial Logistic Regression Results versus Discriminant Analysis Results - 2

Parameter Estimates

-2.430 1.595 2.321 1 .128

-.251 .070 12.852 1 .000 .778 .678

.571 .076 56.336 1 .000 1.770 1.525

-.471 .088 28.667 1 .000 .624 .525

-5.588 1.491 14.037 1 .000

.209 .056 13.767 1 .000 1.233 1.104

.342 .063 29.373 1 .000 1.408 1.244

-.338 .073 21.197 1 .000 .713 .618

Intercept

OUTDOOR

CONVIV

CONSERV

Intercept

OUTDOOR

CONVIV

CONSERV

Job ClassificationPassenger Agents

Mechanics

B Std. Error Wald df Sig. Exp(B) Lower Bound

95% Confidence Interval forExp(B)

Similarly, the logistic regression found that higher conservative scores decrease the likelihood that a subject is a passenger agent rather than operations control. This implies that the average conservative score for passenger agents was lower than the average for operations control, which was consistent with the discriminant finding that passenger agents had lower conservative scores than the other two groups. We can compare the contribution of conservative scores to distinguishing passenger agents from mechanics by subtracting the B coefficient of conservative scores for mechanics from the B coefficient of conservative scores for passenger agents, (-0.471) – (-0.338) = -0.133. The odds ratio for passenger agents versus mechanics is EXP(-0.133) = 0.875. Higher conservative scores decreased the likelihood that a subject was a passenger agent rather than a mechanic, implying that passenger agents had lower average conservative scores than mechanics. In sum, passenger agents had lower average scores than both other groups, consistent with the findings of the discriminant analysis.

slide 1 multinomial logistic regression: a problem in personnel classification in our first class on...

Documents