A comparative analysis of machine learning techniques for student retention management
Post on 05-Sep-2016
Embed Size (px)
f the mtitutionstudenEducathighergemenr adminped co
Decision Support Systems 49 (2010) 498506
Contents lists available at ScienceDirect
.estakeholders . The legislators and policymakers who oversee highereducation and allocate funds, the parents who pay for their children'seducation in order to prepare them for a better future, and the studentswho make college choices look for evidence of institutional quality andreputation to guide their decision-making processes.
In order to improve student retention, one should try to understandthe non-trivial reasons behind the attrition. To be successful, one shouldalso be able to accurately identify those students that are at-risk ofdropping out. So far, the vast majority of student attrition research has
customers is crucial because the related research shows that acquiring anew customer costs roughly ten times more than keeping the one thatyou already have .
2. Literature review
Despite steadily rising enrollment rates in U.S. postsecondaryinstitutions,weakacademicperformanceandhighdropout rates remainpersistent problems among undergraduates ([7,42]). For academicbeen devoted to understanding this compphenomenon. Even though, these qualitativebased studies revealed invaluable insight bywide range of theories, they do not provid
Tel.: +1 918 594 8283.E-mail address: firstname.lastname@example.org.
0167-9236/$ see front matter 2010 Elsevier B.V. Adoi:10.1016/j.dss.2010.06.003untries around theworld.erall nancial loss, loweration in the eyes of all
purpose is to identify among the current customers who are most likelyto leave the company so that some kind of intervention process can beexecuted for the ones who are worthwhile to retain. Retaining existingHigh dropout of students usually results in ovgraduation rates, and inferior school reput1. Introduction
Student attrition has become one ofor decision makers in academic insprograms and services to help retainDepartment of Education, Center forgov), only about half of thosewhoenterbachelors degree. Enrollment manastudents has become a top priority founiversities in the U.S. and other develoost challenging problemss. In spite of all of thets, according to the U.S.ional Statistics (nces.ed.education actually earn at and the retention ofistrators of colleges and
strument to accurately predict (and potentially improve) the studentattrition ([27,44]).
In this project we propose a quantitative research approach wherethe historical institutional data from student databases are used todevelop models that are capable of predicting as well as explaining theinstitution-specic nature of the attrition problem. Though the conceptis relatively new to higher education, for almost a decade now, similarproblems in the eld of marketing have been studied using predictivedata mining techniques under the name of churn analysis,where thelex, yet crucial, social, behavioral, and survey-developing and testing ae the much needed in-
institutions, highplace added burddropping out behuman potentialPoor academic peto college and m
Traditionally,number of stude
ll rights reserved.Machine learningSensitivity analysisA comparative analysis of machine learniretention management
Dursun Delen Spears School of Business, Department of Management Science and Information Systems, O
a b s t r a c ta r t i c l e i n f o
Article history:Received 25 March 2010Received in revised form 5 May 2010Accepted 12 June 2010Available online 17 June 2010
Keywords:Retention managementStudent attritionClassicationPrediction
Student retention is an essenschool reputation, and nancdecision makers in higherunderstanding of the reasonat-risk students and appropralong with several data minmodels to predict and to explshowed that the ensembles pprediction results than theeducational and nancial var
j ourna l homepage: wwwtechniques for student
homa State University, Tulsa, Oklahoma 74106, USA
l part of many enrollment management systems. It affects university rankings,ellbeing. Student retention has becomeone of themost important priorities forucation institutions. Improving student retention starts with a thoroughhind the attrition. Such an understanding is the basis for accurately predictingly intervening to retain them. In this study, using ve years of institutional datatechniques (both individuals as well as ensembles), we developed analyticalthe reasons behind freshmen student attrition. The comparative analyses resultsrmedbetter than individualmodels,while the balanceddataset produced betterbalanced dataset. The sensitivity analysis of the models revealed that theles are among the most important predictors of the phenomenon.
2010 Elsevier B.V. All rights reserved.
l sev ie r.com/ locate /dssattrition rates complicate enrollment planning andens on efforts to recruit new students. For students,fore earning a terminal degree represents untappedand a low return on their investment in college .rformance is often indicative of difculties in adjustingakes dropping out more likely .student attrition at a university has beendened as thents who do not complete a degree in that institution.
499D. Delen / Decision Support Systems 49 (2010) 498506Studies have shown that a vast majority of students withdraw duringtheir rst year of college than during the rest of their higher education([10,16]). Since,most of the student dropouts occur at the endof therstyear (the freshmen year); many of the student retention/attritionstudies (including this study) have focused onrst year dropouts (or thenumber of students not returning for the second year) . Thisdenition of attrition does not differentiate between the students whomay have transferred to other universities and obtained their degreesthere. It only considers the students dropping out at the end of the rstyear voluntarily and not by academic dismissal.
Research on student retention has traditionally been survey driven(e.g., surveying a student cohort and following them for a speciedperiod of time to determinewhether they continue their education) .Using such a design, researchers worked on developing and validatingtheoretical models including the famous student integration modeldeveloped by Tinto . Elaborating on Tinto's theory, others have alsodeveloped student attritionmodels using survey-based research studies([2,3]). Even though they have laid the foundation for the eld, thesesurvey-based research studies have been criticized for their lack ofgeneralized applicability to other institutions and the difculty andcostliness of administering such large-scale survey instruments . Analternative (and/or a complementary) approach to the traditionalsurvey-based retention research is an analytic approachwhere the datacommonly found in institutional databases is used. Educationalinstitutions routinely collect a broad range of information about theirstudents, including demographics, educational background, socialinvolvement, socioeconomic status, and academic progress. A compar-ison between the data-driven and survey-based retention researchshowed that they are comparable at best, and todevelop aparsimoniouslogistic regressionmodel, data-driven researchwas found tobe superiorto its survey-based counterpart . But in reality, these two researchtechniques (one driven by the surveys and other theories driven byinstitutional data and analytic methods) complement and help eachother . That is, the theoretical researchmay help identify importantpredictor variables to beuse inanalytical studieswhile analytical studiesmay reveal novel relationships among the variables which may lead tothe development of new and the betterment of the existing theories.
A number of academic, socioeconomic, and other related factors areassociated with attrition. According to Wetzell et al. , universitieswhich have a more open admission policy and where there is nosubstantial waiting list of applicants and transfers face more seriousstudent attrition problems than universities with surplus applicants. Onthe other hand, Hermanowicz  found that more selective universi-ties do not necessarily have higher graduation rates, rather other factorsnot directly associated with selectivity can, in principle, come into play.In addition to the structural sides of universities (e.g., admission andprestige of school), the cultural side (e.g., norm and values that guidecommunities) should receive equal attention because a higher rate ofretention is often achievedwhen students nd the environment in theiruniversity to be highly correlated with their interests .
In related research, Astin  determined that the persistence or theretention rate of students is greatly affected by the level and quality oftheir interactions with peers as well as faculty and staff. Tinto indicates that the factors in students' dropping out include academicdifculty, adjustment problems, lack of clear academic and career goals,uncertainty, lack of commitment, poor integration with the collegecommunity, incongruence, and isolation. Consequently, retention canbe highly affected by enhancing student interaction with campuspersonnel. Especially for rst-generation college students, the twocritical factors in students' decisions to remain enrolled until theattainment of their goals are their successfully making the transition tocollege, aided by initial and extended orientation and advisementprograms, and making positive connections with college personnelduring their rst term of enrollment .
According to Tinto's theory of student integration , past and
current academic success is a key component in determining attrition.High school GPA and total SAT scores provide insight into potentialacademic performance of the freshmen and have been shown to havea strong positive effect on persistence ([29,41]). Similarly, and prob-ably more importantly, rst semester GPA has been shown to cor-relate strongly with retention ([29,43]). In this study we used theseacademic success indicators.
Institutional and goal commitment are found to be signicantpredictors of student retention . Undecided students may not havethe same level of mental strength and goal commitment as the studentswho are more certain of their career path. Therefore, as a pseudomeasure of academic commitment, declaration of college major andcredit hours carried in the rst semester are included in this study.Additionally, students' residency status (classied as either in-state orout-of-state)maybe an indicator of social and emotional connectednessas well as better integration with the culture of the institution .Students coming from another state may have less familial interaction,which may amplify the feelings of isolation and homesickness.
Several previous studies investigated the effect of nancial aid onstudent retention ([17,18,38]). In these studies, the type of nancialaid was found to be a determinant of student attrition behavior.Students receiving aid based on academic achievement have higherretention rates . Hochstein and Butler  found that grants arepositively associated with student retention while loans have anegative effect. Similarly, Herzog  found that the MillenniumScholarship as well as other scholarships helps students stay enrolledwhile losing these scholarships because of insufcient grades orcredits raises dropout or transfer rates.
In this study, using ve years of freshmen student data (obtainedfrom the university's existing databases) alongwith several dataminingtechniques (both individual as well as ensembles), we developedanalytical models to predict freshmen attrition. In order to betterexplain the phenomenon (identify the relative importance of variables),we conducted sensitivity analysis on the developed models. Therefore,the main goal of this research was twofold: (1) develop models tocorrectly identify the freshmenstudentswhoaremost likely to drop-outafter their freshmen year, and (2) identify themost important variablesby applying sensitivity analyses on developed models. The models thatwe developed are formulated in such away that the prediction occurs atthe endof therst semester (usually at the end of fall semester) in orderfor the decision makers to properly craft intervention programs duringthe next semester (the spring semester) in order to retain them.
In this research, we followed a popular data mining methodologycalled CRISP-DM (Cross Industry Standard Process for Data Mining), which is a six-step process: (1) understanding the domain anddeveloping the goals for the study, (2) identifying, accessing andunderstanding the relevant data sources, (3) pre-processing, cleaning,and transforming the relevant data, (4) developing models usingcomparable analytical techniques, (5) evaluating and assessing thevalidity and the utility of the models against each other and againstthe goals of the study, and (6) deploying the models for use indecision-making processes. This popular methodology provides asystematic and structured way of conducting data mining studies, andhence increasing the likelihood of obtaining accurate and reliableresults. The attention paid to the earlier steps in CRISP-DM (i.e.,understanding the domain of study, understanding data and prepar-ing the data) sets the stage for a successful datamining study. Roughly80% of the total project time is usually spent on these rst three steps.
In this study, to estimate the performance of the predictionmodelsa 10-fold cross-validation approach was used (see Eq. (1) where thevalue of k is set to 10). Empirical studies showed that 10 seem to be anoptimal number of folds (that optimizes the time it takes to completethe test while minimizing the bias and variance associated with the
validation process) . In 10-fold cross-validation the entire dataset
Fig. 1. Data mining and cross validation process.
500 D. Delen / Decision Support Systems 49 (2010) 498506
is divided into 10 mutually exclusive subsets (or folds). Each fold isused once to test the performance of the prediction model that isgenerated from the combined data of the remaining nine folds,leading to 10 independent performance estimates.
A pictorial depiction of this evaluation process is shown in Fig. 1.With this experimental design, if the k is set to 10 (which is the case in
The Earned/Registered hours was created to have a betterrepresentation of the students' resiliency and determination in theirrst semester of the freshmen year. Intuitively, one would expectgreater values for this variable to have a positive impact on retention.The YearsAfterHighSchool was created to measure the impact of thetime taken between high school graduation and initial collegeenrollment. Intuitively, one would expect this variable to be acontributor to the prediction of attrition. These aggregations andderived variables are determined based on a number of experimentsconducted for a...