slide 1 multinomial logistic regression: a problem in personnel classification in our first class on...

of 36/36
Slide 1 Multinomial Logistic Regression: A Problem in Personnel Classification In our first class on Multinomial Logistic Regression we studied Rulon's problem in personnel classification as an example of discriminant alalysis with a three-group categorical variable. We found that there were two statistically significant discriminant functions distinguishing the three groups (mechanics, passenger agents, and operations control). The three predictor variables (outdoor activity score, convivial score, and conservative score) all had a statistically significant relationship to membership in the dependent variable groups. The cross-validated accuracy rate for the discriminant model was 75.0%. Our interpretation of the variables and functions concluded that: The first discriminant function distinguishes Passenger Agents from Mechanics and Operations Control staff. The two variables that are important on the first function are convivial score and conservative score. Passenger agents had higher convivial scores and lower conservative scores than the other two groups. Operations Control staff are distinguished from Mechanics by the second discriminant function which contains only a single variable, the outdoor activity Score. Mechanics had a higher average on the outdoor activity score than did Operations Control staff. In sum, Passenger Agents are more outgoing (convivial) and more tolerant (less conservative) than Mechanics and Operations Control personnel. Mechanics differ from Operations Control personnel in their stronger preference for outdoor oriented activities. Multinomial Logistic Regression

Post on 23-Dec-2015

228 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Slide 1 Multinomial Logistic Regression: A Problem in Personnel Classification In our first class on Multinomial Logistic Regression we studied Rulon's problem in personnel classification as an example of discriminant alalysis with a three-group categorical variable. We found that there were two statistically significant discriminant functions distinguishing the three groups (mechanics, passenger agents, and operations control). The three predictor variables (outdoor activity score, convivial score, and conservative score) all had a statistically significant relationship to membership in the dependent variable groups. The cross-validated accuracy rate for the discriminant model was 75.0%. Our interpretation of the variables and functions concluded that: The first discriminant function distinguishes Passenger Agents from Mechanics and Operations Control staff. The two variables that are important on the first function are convivial score and conservative score. Passenger agents had higher convivial scores and lower conservative scores than the other two groups. Operations Control staff are distinguished from Mechanics by the second discriminant function which contains only a single variable, the outdoor activity Score. Mechanics had a higher average on the outdoor activity score than did Operations Control staff. In sum, Passenger Agents are more outgoing (convivial) and more tolerant (less conservative) than Mechanics and Operations Control personnel. Mechanics differ from Operations Control personnel in their stronger preference for outdoor oriented activities. Multinomial Logistic Regression
  • Slide 2
  • Slide 2 Multinomial Logistic Regression: A Problem in Personnel Classification This problem is from Phillip J. Rulon, David V. Tiedeman, Maurice Tatsuoka, and Charles R. Langmuir. Multivariate Statistics for Personnel Classification. 1967. This sample data is for "World Airlines, a company employing over 50,000 persons and operating scheduled flights. This company naturally needs many men who can be assigned to a particular set of functions. The mechanics on the line who service the equipment of World Airlines form one of the groups we shall consider. A second group are the agents who deal with the passengers of the airline. A third group are the men in operations who coordinate airline activities. The personnel officer of World Airlines has developed an Activity Preference Inventory for the use of the airline. The first section of this inventory contains 30 pairs of activities, each pair naming an indoor activity and an outdoor activity. One item is _____ Billiards : Golf _____ The applicant for a job in World Airlines checks the activity he prefers. The score is the number of outdoor activities marked." (page 24) The second section of the Activity Preference Inventory "contains 35 items. One activity of each pair is a solitary activity, the other convivial. An example is _____ Solitaire : Bridge _____ The apprentice's score is the number of convivial activities he prefers." (page 82) Multinomial Logistic Regression
  • Slide 3
  • Slide 3 Multinomial Logistic Regression: A Problem in Personnel Classification The third section of the Activity Preference Inventory "contains 25 items. One activity of each pair is a liberal activity, the other a conservative activity. An example is _____ Counseling : Advising _____ The apprentice's score is the number of conservative activities he prefers." (page 153) The Activity Preference Inventory was administered to 244 employees in the three job classifications who were successful and satisfied with their jobs. The dependent variable, JOBCLASS 'Job Classification' included three job classifications: 1 - Passenger Agents, 2 - Mechanics, and 3 - Operations Control. The purpose of the analysis is to develop a classification scheme based on scores on the Activity Preference Inventory to assign new employees to the different job groups. The data for this problem are in the file ActivityPreferenceInventory.Sav. We will re-analyze this problem with multinomial logistic regression. Multinomial Logistic Regression
  • Slide 4
  • Slide 4 Stage One: Define the Research Problem In this stage, the following issues are addressed: Relationship to be analyzed Specifying the dependent and independent variables Method for including independent variables Multinomial Logistic Regression Relationship to be analyzed We are interested in the relationship between scores on the three scales of the Activity Preference Inventory and the different job classifications. Specifying the dependent and independent variables The dependent variable is: JOBCLASS 'Job Classification' coded as 1 = Passenger Agents, 2 = Mechanics, and 3 = Operations Control. The independent variables are: OUTDOOR, 'Outdoor Activity Score' CONVIV, 'Convivial Score' CONSERV, 'Conservative Score' Method for including independent variables Direct entry of the independent variables is the only method for including variables available with the SPSS Multinomial Logistic Regression procedure.
  • Slide 5
  • Slide 5 Stage 2: Develop the Analysis Plan: Sample Size Issues In this stage, the following issues are addressed: Missing data analysis Minimum sample size requirement: 15-20 cases per independent variable Multinomial Logistic Regression Missing data analysis There is no missing data in this problem. Minimum sample size requirement: 15-20 cases per independent variable The data set contains 244 subjects and 3 independent variables. The ratio of 81 cases per independent variable exceeds the minimum sample size requirement
  • Slide 6
  • Slide 6 Stage 2: Develop the Analysis Plan: Measurement Issues: In this stage, the following issues are addressed: Incorporating nonmetric data with dummy variables Representing Curvilinear Effects with Polynomials Representing Interaction or Moderator Effects Multinomial Logistic Regression Incorporating Nonmetric Data with Dummy Variables All of the variables are metric. Representing Curvilinear Effects with Polynomials We do not have any evidence of curvilinear effects at this point in the analysis. Representing Interaction or Moderator Effects We do not have any evidence at this point in the analysis that we should add interaction or moderator variables.
  • Slide 7
  • Slide 7 Stage 3: Evaluate Underlying Assumptions In this stage, the following issues are addressed: Nonmetric dependent variable with more than two groups Metric or dummy-coded independent variables Multinomial Logistic Regression Nonmetric dependent variable having more than two groups The dependent variable JOBCLASS 'Job Classification' has three groups. Metric or dummy-coded independent variables All of the independent variables are metric.
  • Slide 8
  • Slide 8 Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Model Estimation In this stage, the following issues are addressed as part of model estimation: Compute logistic regression model Multinomial Logistic Regression Compute the logistic regression The steps to obtain a logistic regression analysis are detailed on the following screens.
  • Slide 9
  • Slide 9 Requesting a Logistic Regression Multinomial Logistic Regression
  • Slide 10
  • Slide 10 Specifying the Dependent Variable Multinomial Logistic Regression
  • Slide 11
  • Slide 11 Specifying the Independent Variables Multinomial Logistic Regression
  • Slide 12
  • Slide 12 Specifying the Statistics to Include in the Output Multinomial Logistic Regression
  • Slide 13
  • Slide 13 Complete the Logistic Regression Request Multinomial Logistic Regression
  • Slide 14
  • Slide 14 Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Assessing Model Fit In this stage, the following issues are addressed: Significance test of the model log likelihood (Change in -2LL) Measures Analogous to R: Cox and Snell R and Nagelkerke R Classification matrices as a measure of model accuracy Check for Numerical Problems Presence of outliers Multinomial Logistic Regression
  • Slide 15
  • Slide 15 Significance test of the model log likelihood The Initial Log Likelihood Function, (-2 Log Likelihood or -2LL) is a statistical measure like total sums of squares in regression. If our independent variables have a relationship to the dependent variable, we will improve our ability to predict the dependent variable accurately, and the log likelihood measure will decrease. The initial log likelihood value (529.883) is a measure of a model with no independent variables, i.e. only the intercept as a constant. The final log likelihood value (280.240) is the measure computed after all of the independent variables have been entered into the logistic regression. The difference between these two measures is the model chi-square value (249.643 = 529.883- 280.240) that is tested for statistical significance. This test is analogous to the F-test for R or change in R value in multiple regression which tests whether or not the improvement in the model associated with the additional variables is statistically significant. In this problem the model Chi-Square value of 118.497 has a significance < 0.0001, so we conclude that there is a significant relationship between the dependent variable and the set of independent variables. Multinomial Logistic Regression
  • Slide 16
  • Slide 16 Measures Analogous to R The next SPSS outputs indicate the strength of the relationship between the dependent variable and the independent variables, analogous to the R measures in multiple regression. The Cox and Snell R measure operates like R, with higher values indicating greater model fit. However, this measure is limited in that it cannot reach the maximum value of 1, so Nagelkerke proposed a modification that had the range from 0 to 1. We will rely upon Nagelkerke's measure as indicating the strength of the relationship. If we applied our interpretive criteria to the Nagelkerke R, we would characterize the relationship as very strong. Multinomial Logistic Regression
  • Slide 17
  • Slide 17 The Classification Matrices as a Measure of Model Accuracy - 1 The classification matrices in logistic regression serve the same function as the classification matrices in Multinomial Logistic Regression, i.e. evaluating the accuracy of the model. If the predicted and actual group memberships are the same, i.e. 1 and 1, 2 and 2, or 3 and 3, then the prediction is accurate for that case. If predicted group membership and actual group membership are different, the model "misses" for that case. The overall percentage of accurate predictions (75.4% in this case) is the measure of a model that I rely on most heavily for this analysis as well as for Multinomial Logistic Regression because it has a meaning that is readily communicated, i.e. the percentage of cases for which our model predicts accurately. To evaluate the accuracy of the model, we compute the proportional by chance accuracy rate and the maximum by chance accuracy rates, if appropriate. The proportional chance criteria for assessing model fit is calculated by summing the squared proportion that each group represents of the sample, in this case (0.348 x 0.348) + (0.381 x 0.381 ) + (0.270 x 0.270) = 0.339. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 0.339= 0.4237. Our model accuracy rate of 75.4% exceeds this standard. Multinomial Logistic Regression
  • Slide 18
  • Slide 18 The Classification Matrices as a Measure of Model Accuracy - 2 The maximum chance criteria is the proportion of cases in the largest group, 34.8% in this problem. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 34.8% = 43.5%. Our model accuracy rate of 75.4% exceeds this standard. Multinomial Logistic Regression
  • Slide 19
  • Slide 19 Check for Numerical Problems There are several numerical problems that can occur in logistic regression that are not detected by SPSS or other statistical packages: multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and "complete separation" whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. All of these problems produce large standard errors (over 2) for the variables included in the analysis and very often produce very large B coefficients as well. If we encounter large standard errors for the predictor variables, we should examine frequency tables, one-way ANOVAs, and correlations for the variables involved to try to identify the source of the problem. None of the standard errors or B coefficients are excessively large, so there is no evidence of a numeric problem with this analysis. Multinomial Logistic Regression
  • Slide 20
  • Slide 20 Demonstrating Multicollinearity To demonstrate the identification of multicollinearity, I created a duplicate variable named OT2 that was identical to OUTDOOR for all but one case. SPSS provides us with a warning on the Likelihood Ratio Test. In the table of parameter estimates, we find very large values for the standard errors of the collinear variables, OUTDOOR and OT2. Multinomial Logistic Regression
  • Slide 21
  • Slide 21 Presence of outliers Multinomial Logistic Regression Multinomial logistic regression does not provide any output for detecting outliers. However, if we are concerned with outliers, we can identify outliers on the combination of independent variables by computing Mahalanobis distance in the SPSS regression procedure.
  • Slide 22
  • Slide 22 Stage 5: Interpret the Results In this section, we address the following issues: Identifying the statistically significant predictor variables Direction of relationship and contribution to dependent variable Multinomial Logistic Regression
  • Slide 23
  • Slide 23 Identifying the statistically significant predictor variables There are two outputs related to the statistical significance of individual predictor variables: the Likelihood Ratio Tests and Parameter Estimates. The Likelihood Ratio Tests indicate the contribution of the variable to the overall relationship between the dependent variable and the individual independent variables. The Parameter Estimates focus on the role of each independent variable in differentiating between the groups specified by the dependent variable. In addition, all three variables play a significant role on both of the logistic regression equations. Multinomial Logistic Regression The likelihood ratio tests are a hypothesis test that the variable contributes to the reduction is error measured by the 2 log likelihood statistic. In this model, the three variables are all significant contributors to explaining differences in job classification.
  • Slide 24
  • Slide 24 Direction of relationship and contribution to dependent variable - 1 Interpretation of the independent variables is based on the sign of the coefficient in the B column and aided by the "Exp (B)" column which contains the odds ratio for each independent variable. We can state the relationships as follows: Multinomial Logistic Regression
  • Slide 25
  • Slide 25 Direction of relationship and contribution to dependent variable - 2 Multinomial Logistic Regression We can state the relationships as follows: Higher outdoor activity scores decrease the likelihood that the subject is a passenger agent instead of operations control and increase the likelihood that the subject is a mechanic rather than operations control. A one-unit increase in outdoor activity score is associated with a 22% decrease (0.778 1.0) in the odds of being classified as a passenger agent. A one- unit increase in outdoor activity score is associated with a 23% increase (1.233 1.0) in the odds of being classified as a mechanic. Higher convivial scores increase the likelihood that the subject is a passenger agent or a mechanic, rather than operations control. It would probably make more sense to our audience to reverse the direction of these relationships and combine them to state that higher convivial scores decrease the likelihood that the subject is operations control, rather than a passenger agent or a mechanic. A one-unit increase in convivial score is associated with a 44% decrease in the likelihood of being operations control rather than a passenger agent (1/1.77 1.0). A one-unit increase in convivial score is associated with a 29% decrease in the likelihood of being operations control rather than a mechanic (1/1.408 1.0). Higher conservative scores decrease the likelihood of being a passenger agent rather than operations control by 38% (0.624 1.0), and decrease the likelihood of being a mechanic rather than operations control by 29% (0.713 1.0).
  • Slide 26
  • Slide 26 Stage 6: Validate The Model The first step in our validation analysis is to create the split variable. * Compute the split variable for the learning and validation samples. SET SEED 2000000. COMPUTE split = uniform(1) > 0.50. EXECUTE. Multinomial Logistic Regression
  • Slide 27
  • Slide 27 Creating the Multinomial Logistic Regression for the First Half of the Data Next, we run the multinomial logistic regression on the first half of the sample, where split = 0. * Select the cases to include in the first validation analysis. USE ALL. COMPUTE filter_$=(split=0). FILTER BY filter_$. EXECUTE. * Run the multinomial logistic regression for these cases. NOMREG jobclass WITH outdoor conviv conserv /CRITERIA = CIN(95) DELTA(0) MXITER(100) MXSTEP(5) LCONVERGE(0) PCONVERGE(1.0E-6) SINGULAR(1.0E-8) /MODEL /INTERCEPT = INCLUDE /PRINT = CLASSTABLE PARAMETER SUMMARY LRT. Multinomial Logistic Regression
  • Slide 28
  • Slide 28 Entering the Logistic Regression Coefficients into SPSS To compute the classification scores for the logistic regression equations, we need to enter the coefficients for each equation into SPSS. Next, we enter the B coefficients into SPSS using compute commands. For the first set of coefficients, we will use the letter A, followed by a number. For the second set of coefficients, we will use the letter B, followed by a number. The complete set of compute commands are below the graphic. Multinomial Logistic Regression
  • Slide 29
  • Slide 29 Create the coefficients in SPSS * Assign the coefficients from the model just run to variables. Compute A0 = -2.4701157307273. Compute A1 = -0.241834593548177. Compute A2 = 0.593813821000787. Compute A3 = -0.527590120671775. Compute B0 = -5.7234318621925. Compute B1 = 0.248534933393711. Compute B2 = 0.346110750220431. Compute B3 = -0.382239899553461. Execute. Multinomial Logistic Regression
  • Slide 30
  • Slide 30 Entering the Logistic Regression Equations into SPSS The logistic regression equations can be entered as compute statements. We will also enter the zero value for the third group, g3. Compute g1 = A0 + A1 * OUTDOOR + A2 * CONVIV + A3 * CONSERV. Compute g2 = B0 + B1 * OUTDOOR + B2 * CONVIV + B3 * CONSERV. Compute g3 = 0. Execute. When these statements are run in SPSS, the scores for g1, g2, and g3 will be added to the dataset. Multinomial Logistic Regression
  • Slide 31
  • Slide 31 Converting Classification Scores into Predicted Group Membership We convert the three scores into odds ratios using the EXP function. When we divide each score by the sum of the three odds ratios, we end up with a probability of membership in each group. * Compute the probabilities of membership in each group. Compute p1 = exp(g1) / (exp(g1) + exp(g2) + exp(g3)). Compute p2 = exp(g2) / (exp(g1) + exp(g2) + exp(g3)). Compute p3 = exp(g3) / (exp(g1) + exp(g2) + exp(g3)). Execute. The follow if statements compare probabilities to predict group membership. * Translate the probabilities into predicted group membership. If (p1 > p2 and p1 > p3) predgrp = 1. If (p2 > p1 and p2 > p3) predgrp = 2. If (p3 > p1 and p3 > p2) predgrp = 3. Execute. When these statements are run in SPSS, the dataset will have both actual and predicted membership for the first validation sample. Multinomial Logistic Regression
  • Slide 32
  • Slide 32 The Classification Table To produce a classification table for the validation sample, we change the filter criteria to include cases where split = 1, and create a contingency table of predicted voting versus actual voting. USE ALL. COMPUTE filter_$=(split=1). FILTER BY filter_$. EXECUTE. CROSSTABS /TABLES=jobclass BY predgrp /FORMAT= AVALUE TABLES /CELLS= COUNT TOTAL. These command produce the following table. The classification accuracy rate is computed by adding the percents for the cells where predicted accuracy coincides with actual voting behavior: 23.1% + 34.7% + 19.8% = 77.6%. We enter this information in the validation table. Multinomial Logistic Regression
  • Slide 33
  • Slide 33 Computing the Second Validation Analysis The second validation analysis follows the same series of command, except that we build the model with the cases where split = 1 and validate the model on cases where split = 0. The results from my calculations have been entered into the validation table below. Multinomial Logistic Regression
  • Slide 34
  • Slide 34 Generalizability of the Multinomial Logistic Regression Model We can summarize the results of the validation analyses in the following table. Multinomial Logistic Regression From the validation table, we see that the original model is verified by the accuracy rates for the validation analyses. The significant predictors are the same in all analysis so the validation has been successful. Full ModelSplit = 0Split = 1 Model Chi-Square249.643, p < 0.0001128.409, p < 0.0001118,992, p < 0.0001 Nagelkerke R 2 0.7220.7310.708 Accuracy Rate for Learning Sample 75.4%74.8%76.9% Accuracy Rate for Validation Sample 77.6%73.1% Significant Coefficients (p < 0.05) OUTDOOR CONVIV CONSERV OUTDOOR CONVIV CONSERV OUTDOOR CONVIV CONSERV
  • Slide 35
  • Slide 35 Multinomial Logistic Regression Results versus Discriminant Analysis Results - 1 Both analyses found that the three predictor variables were useful and accurate in identifying differences between the three classifications of personnel. Reconciling the differences between interpretation of the individual coefficients requires more work. In the logistic regression, we found that higher outdoor activity scores decrease the likelihood that the subject is a passenger agent instead of operations control and increase the likelihood that the subject is a mechanic rather than operations control. A higher outdoor activity score increasing the likelihood that a person is a mechanic rather than operations control is consistent with the finding in discriminant analysis that mechanics had a higher average outdoor activity score than did operations control staff. In interpreting convivial scores, we stated that higher convivial scores decrease the likelihood that the subject is operations control, rather than a passenger agent or a mechanic. This would imply that the average convivial score was lower for operations control than the other two groups. While this was verified by the discriminant analysis, we identified the difference as passenger agents having a higher convivial score than either mechanics or operations control. From the comparison thus far, we can conclude that passenger agents had higher average convivial scores than operations control, but it takes more work to verify that passenger agents had higher convivial scores than mechanics. We can compare the contribution of convivial scores to distinguishing passenger agents from mechanics by subtracting the B coefficient of convivial scores for mechanics from the B coefficient of convivial scores for passenger agents, 0.571-0.342 = 0.229. The odds ratio for passenger agents versus mechanics is EXP(0.229) = 1.258. Higher convivial scores increased the likelihood that a subject was a passenger agent rather than a mechanic, implying that passenger agents had higher average convivial scores than mechanics. The difference between the interpretation of convivial scores for the two analyses is a difference in emphasis rather than an inconsistency in the results of the analysis, even though we might not be able to reconcile the differences immediately.
  • Slide 36
  • Slide 36 Multinomial Logistic Regression Results versus Discriminant Analysis Results - 2 Similarly, the logistic regression found that higher conservative scores decrease the likelihood that a subject is a passenger agent rather than operations control. This implies that the average conservative score for passenger agents was lower than the average for operations control, which was consistent with the discriminant finding that passenger agents had lower conservative scores than the other two groups. We can compare the contribution of conservative scores to distinguishing passenger agents from mechanics by subtracting the B coefficient of conservative scores for mechanics from the B coefficient of conservative scores for passenger agents, (-0.471) (-0.338) = -0.133. The odds ratio for passenger agents versus mechanics is EXP(-0.133) = 0.875. Higher conservative scores decreased the likelihood that a subject was a passenger agent rather than a mechanic, implying that passenger agents had lower average conservative scores than mechanics. In sum, passenger agents had lower average scores than both other groups, consistent with the findings of the discriminant analysis.