starfishpredictivemodeling eriecommunitycollegetraining/ec_training/po… ·...
TRANSCRIPT
© 2015 Hobsons Inc. | Proprietary and Confidential Page 1 of 18
Starfish Predictive Modeling
Erie Community College 10/23/15
Table of Contents Methodology ....................................................................................................................................... 2 Performance ........................................................................................................................................ 2 Strongest Predictors ............................................................................................................................. 4 Observations/Comments ..................................................................................................................... 5 Comments about Data Quality ............................................................................................................. 5 Figures ................................................................................................................................................. 7
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 2 of 18
Methodology Hobsons analyzed historical data provided by Erie Community College via the Starfish platform. The analysis includes three years of outcome data beginning with the Fall 2011 term and ending with the Fall 2014 term. We developed a predictive model using a random sample of 20,000 semester registrations selected from that time period. The model predicts the outcome for a student who is enrolled in a given term. A positive outcome is defined to be one where the student either graduates at the end of that term or begins another term of study after that term (i.e., is a persisting student). In practice, because the data is time-‐censored, we never know if a student has permanently exited. A student is counted as non-‐persisting if there are no subsequent registrations in the database. We excluded data later than Fall 2014 to provide a one-‐year window to observe a student’s potential return. Students who hadn’t returned (or graduated) by Fall 2015 are presumed to be non-‐persisting. From the sample of 20,000 semester registrations, 75.9% were characterized as returning and 29.9% non-‐returning.
We evaluated the performance of the model using 25,790 term-‐registration records that were not used in constructing the model.
We used a standard set of predictor variables derived from data in the Starfish database. We also incorporate any user attributes provided by the institution having values that do not change over time.
We used the model obtained from the historic data to provide a predictive score for all currently enrolled students.
Performance To determine predictive power we use the so-‐called c-‐statistic (sometimes know as the area under the ROC curve), which measures whether students with higher scores tend to persist more than students with lower scores. The c-‐statistic is the probability that a randomly selected persisting student will have a higher predictive score than a randomly selected non-‐persisting student. The c-‐statistic on the validation set was 74%. A perfect predictor would have a c-‐statistic of 100% while a random guesser would score 50%. This model performs somewhere in the middle. Although validation data was not used to construct the model, validation data and the data used to build the model are taken from the same statistical distribution (Fall 2011 to Fall 2014). The distribution for students attending in Fall 2015 may be different.
Figure 1 below shows the ability of the model to discriminate between students who persist and those who don’t. Specifically, it shows the distribution of scores that the model assigned to students who did (green) and did not (red) persist, without using any foreknowledge of the outcome. In a predictive model, students who persist tend to have higher scores than those who do not. Figure 1 exhibits this tendency. A randomly selected persisting student is more likely to have a score above 75% than a randomly selected non-‐persisting student.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 3 of 18
Figure 1 -‐ Distribution (probability density function) of predictive scores for both persisting and non-‐persisting students obtained from the validation set. Green indicates persisting and red non-‐persisting. The horizontal axis is the score, which ranges from 0 to 1. The vertical coordinate is a smoothed density estimate. The mean score is 0.759 (i.e., 75.9%). A score above 75.9% is above average and has a lower than average risk. Students with scores below 50% (about 9.6% of the validation-‐set students) have a relatively high risk of not returning for another term. Students with scores in the range of 50% to 75.9% are at an elevated risk, but more likely than not, they will persist for an additional term.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 4 of 18
Strongest Predictors Our models are nonlinear regressions. Predictive variables, therefore, do not have coefficients used in linear regression models. The strength of each predictor is measured by a “variable importance factor” that measures the effect of that variable in reducing modeling error. It is a rough guide to the importance of individual attributes. The 10 predictor variables with the highest variable importance are listed in Table 1. Here, the variable importance scores have been normalized so that they sum to 100%, and these top 10 represent about 41% of the total. Two of the institution-‐provided attributes appeared to have a significant variable importance score: “EDUCATION GOALS” with an importance of 0.027 and “EMPLOYMENT” with an importance of 0.024.
Table 1 -‐ Important Predictive Variables with Measures of their Importance
Variable Description Variable Importance
Cumulative GPA (at start of term) 0.048
Program 0.048
Age Entering the Institution 0.045
Age Entering the Program 0.045
Current Age 0.043
Attempted Hours 0.040
Cumulative Quality Points (at start of term) 0.038
GPA (prior term) 0.035
Quality Points (last term) 0.034
Credential 0.032
Figures 2 through 13 that follow show comparisons of model predictions to actual outcomes for these 10 variables (with the exception of the program) and a few additional variables. The green dots
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 5 of 18
represent actual outcomes (on validation data) for groups of students with similar values of the independent variable. The horizontal position of each dot represents the average of the independent variable for the group it represents. The vertical position represents the persistence rate. The vertical position of the green dots represents the actual outcome and the vertical position of the red dots represent the outcome predicted by the model. No data with a student/term combination that was used to construct the model was used as validation data, so the model had no foreknowledge of the outcome for that term. The area of the dot is proportional to the sample size. The variance goes up as the sample size goes down. In other words, the persistence rate cannot be predicted with the same precision for smaller samples as for larger sample sizes.
The student’s program is a significant factor partly due to the Non-‐Matriculated students being a large group of students with a relatively low persistence rate. Table 2 shows the model-‐predicted and actual persistence rates for the 18 largest programs.
Observations/Comments Age is an important factor. Students who begin a program when their age is in a range of about 22 to 23 tend to persist less than those who are older or younger (Figures 3 and 4). Students who are currently in this age range (who may have entered at a younger age) are also less likely to persist (Figure 5). Persistence appears to improve as students move beyond this critical age.
Another critical period is either the first year or first 30 hours of study. The likely of persistence steadily increases during this period. This trend appears in several graphs—Figure 7 which shows persistence plotted against cumulative quality points, Figure 11 which shows persistence plotted against time at the institution, and Figure 12 which shows persistence plotted against cumulative earned credit hours.
Persistence decreases when the cumulative GPA or the prior term GPA decreases below 2.0. Persistence increases as the credit-‐hour load increases from 0 to about 16 hours (Figure 6).
Comments about Data Quality The quality of the predictions may be limited by the quality of the data. We have made an effort to detect and mitigate data quality problems where possible. However, not all of the irregularities have been corrected. As one example, term GPAs and cumulative GPAs were reported as zero rather than null for students who had no classes. This makes it difficult to distinguish a student who has a poor performance (an earned GPA of zero) from a student who took no classes. As another example, we observed that some (but not all) students earned credit (per the student term status file) for courses they had failed (per the course outcome file). The methods we use are very robust and should cope with these cases, provided that there is no bias in the data. Bias would be introduced, for example, by using a different reporting process for historical student data vs current student data. It could also be introduced by modifying historical data for students who have left (but not those who were retained) after the fact.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 6 of 18
Table 2 -‐ Predicted and Actual Persistence Rates by Program (large programs only). The average persistence rate (predicted and actual, all programs) was about 75.9%.
Program Actual Persistence Predicted Persistence
Sample Size
Non-‐Matriculated 57.6% 54.4% 3673
Bus-‐Business Administration 73.1% 78.7% 759
Early Childhood 73.8% 77.3% 325
Physical Education Studies 74.6% 77.8% 268
Culinary Arts 75.3% 78.9% 344
Mntl Hlth Ass't-‐Substance Abuse 76.0% 79.4% 312
Business Administration 76.4% 78.4% 1428
Lib Arts & Sci-‐Hum & Soc Sci. 76.5% 77.4% 1313
Criminal Justice 76.7% 78.0% 1000
Lib Arts & Sci-‐Mathematics & Sci. 77.8% 79.4% 609
Lib Arts & Sci-‐General Studies 78.2% 78.7% 8206
Criminal Justice: Law Enforcement 78.7% 79.3% 567
Engineering Science 81.4% 80.7% 376
Automotive Technology 81.6% 79.9% 412
Information Technology 82.3% 80.7% 277
Communication & Media Arts-‐Communication Arts
83.3% 81.3% 324
Paralegal 83.4% 82.7% 271
Nursing 94.4% 91.0% 554
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 7 of 18
Figures
Figure 2 -‐ Predicted and Actual Persistence Rates vs Cumulative GPA (at the beginning of the term of attendance). Each dot represents a group of students having approximately the same GPA. The green dots represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 8 of 18
Figure 3 -‐ Predicted and Actual Persistence Rates vs Age Entering the Institution. Each dot represents a group of students having approximately the same age. The green dots represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 9 of 18
Figure 4 -‐ Predicted and Actual Persistence Rates vs Age Entering the Program. Each dot represents a group of students having approximately the same age. The green dots represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 10 of 18
Figure 5 -‐ Predicted and Actual Persistence Rates vs Current Age (at the beginning of the term of attendance). Each dot represents a group of students having approximately the same age. The green dots represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 11 of 18
Figure 6 -‐ Predicted and Actual Persistence Rates vs Attempted Hours. Each dot represents a group of students having approximately the same attempted hours. The green dots represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 12 of 18
Figure 7 -‐ Predicted and Actual Persistence Rates vs Cumulative Quality Points (at the beginning of the term of attendance). Each dot represents a group of students having approximately the same quality points. The green dots represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 13 of 18
Figure 8 -‐ Predicted and Actual Persistence Rates vs Term GPA (prior term). Each dot represents a group of students having approximately the same term GPA. The green dots represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 14 of 18
Figure 9 -‐ Predicted and Actual Persistence Rates vs Term Quality Points (prior term). Each dot represents a group of students having approximately the same term quality points. The green dots represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 15 of 18
Figure 10 -‐ Predicted and Actual Persistence Rates vs Credential Sought. The green bars represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 16 of 18
Figure 11 -‐ Predicted and Actual Persistence Rates vs Time at Institution (as of the beginning of the term of attendance). Each dot represents a group of students having approximately the same time at the institution. The green dots represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 17 of 18
Figure 12 -‐ Predicted and Actual Persistence Rates vs Cumulative Earned Hours (as of the beginning of the term of attendance). Each dot represents a group of students having approximately the same earned hours. The green dots represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.
Starfish Predictive Modeling – Erie Community College
© 2015 Hobsons Inc. | Proprietary and Confidential Page 18 of 18
Figure 13 -‐ Predicted and Actual Persistence Rates vs Education Goals Attribute. The green bars represent actual outcomes for students during a term in which they were enrolled. The red dots represent the model prediction for the same group of students. Student/term combinations used to build the model were excluded from the validation data. However, the validation data comes from the same time period as the data used to construct the model so it has the same statistical distribution.