predicting school failure and dropout by using data mining techniques

8
IEEE JOURNAL OF LATIN-AMERICAN LEARNING TECHNOLOGIES, VOL. 8, NO. 1, FEBRUARY 2013 7 Predicting School Failure and Dropout by Using Data Mining Techniques Carlos Márquez-Vera, Cristóbal Romero Morales, and Sebastián Ventura Soto Abstract—This paper proposes to apply data mining tech- niques to predict school failure and dropout. We use real data on 670 middle-school students from Zacatecas, México, and employ white-box classification methods, such as induction rules and decision trees. Experiments attempt to improve their accuracy for predicting which students might fail or dropout by first, using all the available attributes; next, selecting the best attributes; and finally, rebalancing data and using cost sensitive classification. The outcomes have been compared and the models with the best results are shown. Index Terms—Classification, dropout, educational data mining (EDM), prediction, school failure. I. I NTRODUCTION R ECENT years have shown a growing interest and concern in many countries about problem of school failure and the determination of its main contributing factors [1]. The great deal of research [2] has been done on identifying the factors that affect the low performance of students (school failure and dropout) at different educational levels (primary, secondary and higher) using the large amount of information that current computers can store in databases. All these data are a “gold mine” of valuable information about students. Identify and find useful information hidden in large databases is a difficult task [3]. A very promising solution to achieve this goal is the use of knowledge discovery in databases techniques or data mining in education, called educational data mining, EDM [4]. This new area of research focuses on the development of methods to better understand students and the settings in which they learn [5]. In fact, there are good examples of how to apply EDM techniques to create models that predict dropping out and student failure specifically [6]. These works have shown promising results with respect to those sociological, economic, or educational characteristics that may be more relevant in the prediction of low academic performance [7]. It is also important to notice that most of the research on the application of EDM to resolve the problems of student failure and drop-outs has been applied primarily to the specific case of higher education [8] and more specifically to online or Manuscript received December 19, 2012; revised January 17, 2013; accepted January 18, 2013. Date of publication February 14, 2013; date of current version February 26, 2013. This work was supported by the Regional Government of Andalucia and the Ministry of Science and Technology under Project TIN-2011-22408 and Project P08-TIC-3720. C. Márquez-Vera is with the Autonomous University of Zacatecas, Zacate- cas 98000, México (e-mail: [email protected]). C. R. Morales and S. V. Soto are with the University of Cordoba, Córdoba 14071, Spain (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/RITA.2013.2244695 distance education [9]. However, very little information about specific research on elementary and secondary education has been found, and what has been found uses only statistical methods, not DM techniques [10]. There are several important differences and/or advantages between applying data mining with respect to only using statistical models [11]: 1) Data mining is a broad process that consists of several stages and includes many techniques, among them the statistics. This knowledge discovery process comprises the steps of pre-processing, the application of DM techniques and the evaluation and interpretation of the results. 2) Statistical techniques (data analysis) are often used as a quality criterion of the verisimilitude of the data given the model. DM uses a more direct approach, such asto use the percentage of well classified data. 3) In statistics, the search is usually done by modeling based on a hill climbing algorithm in combination with a verisimilitude ratio test-based hypothesis. DM is often used a meta-heuristics search. 4) DM is aimed at working with very large amounts of data (millions and billions). The statistics does not usually work well in large databases with high dimensionality. This study proposes to predict student failure at school in middle or secondary education by using DM. In fact, we want to detect the factors that most influence student failure in young students by using classification techniques. Also we propose the use of different techniques of DM because is a complex problem, data have high dimensionality (there are many factors that can influence) and often highly unbalanced (the majority of students pass and too few fail). The final objective is to detect as early as possible the students who show these factors in order to provide some type of assistance for trying to avoid and/or reduce school failure. The paper is organized as follows: Section II presents our proposed method for predicting school failure. Section III describes data used and the information sources from we gathered. Section IV describes the data preprocessing step. Section V describes the different experiments carried out and the results obtained. In section VI, we present the interpreta- tion of our results and finally in section VII, summarizes the main conclusions and future research. II. METHOD The method proposed in this paper for predicting the acad- emic failure of students belongs to the process of Knowledge 1932-8540/$31.00 © 2013 IEEE

Upload: sebastian-ventura

Post on 08-Dec-2016

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Predicting School Failure and Dropout by Using Data Mining Techniques

IEEE JOURNAL OF LATIN-AMERICAN LEARNING TECHNOLOGIES, VOL. 8, NO. 1, FEBRUARY 2013 7

Predicting School Failure and Dropout byUsing Data Mining Techniques

Carlos Márquez-Vera, Cristóbal Romero Morales, and Sebastián Ventura Soto

Abstract— This paper proposes to apply data mining tech-niques to predict school failure and dropout. We use real data on670 middle-school students from Zacatecas, México, and employwhite-box classification methods, such as induction rules anddecision trees. Experiments attempt to improve their accuracy forpredicting which students might fail or dropout by first, using allthe available attributes; next, selecting the best attributes; andfinally, rebalancing data and using cost sensitive classification.The outcomes have been compared and the models with the bestresults are shown.

Index Terms— Classification, dropout, educational data mining(EDM), prediction, school failure.

I. INTRODUCTION

RECENT years have shown a growing interest and concernin many countries about problem of school failure and

the determination of its main contributing factors [1]. The greatdeal of research [2] has been done on identifying the factorsthat affect the low performance of students (school failure anddropout) at different educational levels (primary, secondaryand higher) using the large amount of information that currentcomputers can store in databases. All these data are a “goldmine” of valuable information about students. Identify andfind useful information hidden in large databases is a difficulttask [3]. A very promising solution to achieve this goal is theuse of knowledge discovery in databases techniques or datamining in education, called educational data mining, EDM [4].This new area of research focuses on the development ofmethods to better understand students and the settings in whichthey learn [5]. In fact, there are good examples of how toapply EDM techniques to create models that predict droppingout and student failure specifically [6]. These works haveshown promising results with respect to those sociological,economic, or educational characteristics that may be morerelevant in the prediction of low academic performance [7].It is also important to notice that most of the research onthe application of EDM to resolve the problems of studentfailure and drop-outs has been applied primarily to the specificcase of higher education [8] and more specifically to online or

Manuscript received December 19, 2012; revised January 17, 2013; acceptedJanuary 18, 2013. Date of publication February 14, 2013; date of currentversion February 26, 2013. This work was supported by the RegionalGovernment of Andalucia and the Ministry of Science and Technology underProject TIN-2011-22408 and Project P08-TIC-3720.

C. Márquez-Vera is with the Autonomous University of Zacatecas, Zacate-cas 98000, México (e-mail: [email protected]).

C. R. Morales and S. V. Soto are with the University of Cordoba, Córdoba14071, Spain (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/RITA.2013.2244695

distance education [9]. However, very little information aboutspecific research on elementary and secondary education hasbeen found, and what has been found uses only statisticalmethods, not DM techniques [10].

There are several important differences and/or advantagesbetween applying data mining with respect to only usingstatistical models [11]:

1) Data mining is a broad process that consists of severalstages and includes many techniques, among them thestatistics. This knowledge discovery process comprisesthe steps of pre-processing, the application of DMtechniques and the evaluation and interpretation of theresults.

2) Statistical techniques (data analysis) are often used as aquality criterion of the verisimilitude of the data giventhe model. DM uses a more direct approach, such astouse the percentage of well classified data.

3) In statistics, the search is usually done by modelingbased on a hill climbing algorithm in combination witha verisimilitude ratio test-based hypothesis. DM is oftenused a meta-heuristics search.

4) DM is aimed at working with very large amounts of data(millions and billions). The statistics does not usuallywork well in large databases with high dimensionality.

This study proposes to predict student failure at school inmiddle or secondary education by using DM. In fact, we wantto detect the factors that most influence student failure inyoung students by using classification techniques. Also wepropose the use of different techniques of DM because is acomplex problem, data have high dimensionality (there aremany factors that can influence) and often highly unbalanced(the majority of students pass and too few fail). The finalobjective is to detect as early as possible the students whoshow these factors in order to provide some type of assistancefor trying to avoid and/or reduce school failure.

The paper is organized as follows: Section II presents ourproposed method for predicting school failure. Section IIIdescribes data used and the information sources from wegathered. Section IV describes the data preprocessing step.Section V describes the different experiments carried out andthe results obtained. In section VI, we present the interpreta-tion of our results and finally in section VII, summarizes themain conclusions and future research.

II. METHOD

The method proposed in this paper for predicting the acad-emic failure of students belongs to the process of Knowledge

1932-8540/$31.00 © 2013 IEEE

Page 2: Predicting School Failure and Dropout by Using Data Mining Techniques

8 IEEE JOURNAL OF LATIN-AMERICAN LEARNING TECHNOLOGIES, VOL. 8, NO. 1, FEBRUARY 2013

Factors to use

Dataset

Attribute selection

Data Balancing

ClassificationCost-sensitive classification

Model

Data Gathering

Pre-processing

Data Mining

Interpretation

Fig. 1. Method proposed for the prediction of student failure.

Discovery and Data Mining (see Fig. 1). The main stages ofthe method are:

1) Data gathering. This stage consists in gathering allavailable information on students. To do this, the set offactors that can affect the students’ performance mustbe identified and collected from the different sourcesof data available. Finally, all the information should beintegrated into a dataset.

2) Pre-processing. At this stage the dataset is preparedto apply the data mining techniques. To do this, tra-ditional pre-processing methods such as data cleaning,transformation of variables, and data partitioning haveto be applied. Other techniques such as the selectionof attributes and the re-balancing of data have alsobeen applied in order to solve the problems of highdimensionality and imbalanced data that are typicallypresented in these datasets.

3) Data mining. At this stage, DM algorithms are applied topredict student failure like a classification problem. Todo this task, we propose to use classification algorithmsbased on rules and decision trees. These are“white box”techniques that generate easily interpretable models. Inaddition, a cost sensitive classification approach is alsoused in order to solve the imbalanced data problem.Finally, different algorithms have been executed, eval-uated and compared in order to determine which oneobtains the best results.

4) Interpretation. At this stage, the obtained models areanalyzed to detect student failure. To achieve this,the factors that appear (in the rules and decisiontrees) and how they are related are considered andinterpreted.

Next, we describe a case of study with real data from Mexicanstudents in order to show the utility of the proposed method.

III. DATA GATHERING

School failure of student is also known as the “one thousandfactors problem” [12], due to the large amount of risk factorsor characteristics of the students that can influence schoolfailure, such as demographics, cultural, social, family, oreducational background, socioeconomic status, psychologicalprofile, and academic progress.

In this paper, we have used information about high schoolstudents enrolled on Program II of the Academic Unit Prepara-tory at the Autonomous University of Zacatecas (UAPUAZ)for the 2009/10 academic year. In the Mexican educationalsystem, high school corresponds to the preparatory years whenthe students are aged 15-18 years old. High school offers athree-year education program that provides the student withgeneral knowledge to continue studying at university. We haveonly used information about first-year high school students,where most students are between the ages of 15 and 16, asthis is the year with the highest failure rate. All the informationused in this study has been gathered from three differentsources during the period from August to December 2010:

1) A specific survey was designed and administered to allstudents in the middle of the course. Its purpose was toobtain personal and family information to identify someimportant factors that could affect school performance.

2) From a general survey which is completed when thestudents register in the National Evaluation Center(CENEVAL) for admission to many institutions of sec-ondary and higher education.

3) The Department of School Services provides the scoresobtained by students in all subjects of the course.

In Table I, all the used variables in this study are showngrouped by data source.

IV. DATA PRE-PROCESSING

Before applying DM algorithm sit is necessary to carryout some pre-processing tasks such as cleaning, integration,discretization and variable transformation [13]. It must bepointed out that very important task in this work was datapre-processing, due to the quality and reliability of availableinformation, which directly affects the results obtained. In fact,some specific pre-processing tasks were applied to prepare allthe previously described data so that the classification taskcould be carried out correctly. Firstly, all available data wereintegrated into a single dataset. During this process thosestudents without 100% complete information were eliminated.All students who did not answer our specific survey or theCENEVAL survey were excluded. Some modifications werealso made to the values of some attributes. For example,words that contained the letter “Ñ” were replaced by “N”.A new attribute of the age of each student in years wascreated using the day, month, and year of birth of each student.Furthermore, the continuous variables were transformed intodiscrete variables, which provide a much more comprehensibleview of the data. For example, the numerical values of thescores obtained by students in each subject were changed tocategorical values in the following way:

Page 3: Predicting School Failure and Dropout by Using Data Mining Techniques

MÁRQUEZ-VERA et al.: PREDICTING SCHOOL FAILURE AND DROPOUT 9

TABLE I

VARIABLES USED AND INFORMATION SOURCES

Source Variable

Specific survey

Classroom/group, number of students in group,attendance during morning/evening sessions,

number of friends, number of hours spentstudying daily, methods of study used, place

normally used for studying, having one’s ownspace for studying, resources for study, study

habits, studying in group, parentalencouragement for study, marital status, having

any children, religion, having administrativesanctions, the type of degree selected, the

influence on the degree selected, the type ofpersonality, having a physical disability,

suffering a critical illness, regular consumptionof alcohol, smoking habits, family income level,having a scholarship, having a job, living withone’ s parents, mother’ s level of education,

father’ s level of education, number ofbrothers/sisters, position as the

oldest/middle/youngest child, living in a largecity, number of years living in the city, transport

method used to go school, distance to theschool, level of attendance during classes, level

or boredom during classes, interest in thesubjects, level of difficulty of the subjects, levelof motivation, taking notes in class, methods of

teaching, too heavy a demand of homework,quality of school infrastructure, having a

personal tutor, level of teacher’ s concern forthe welfare of each student.

CENEVAL

Age, sex, previous school, type of school, typeof secondary school, Grade Point Average

(GPA) in secondary school, mother’ soccupation, father’ s occupation, number of

family members, limitations for doing exercises,frequency of exercises, time spent doing

exercises, score obtained in Logic, score inMath, score in Verbal Reasoning, score in

Spanish, score in Biology, score in Physics,score in Chemistry, score in History, score inGeography, score in Civics, score in Ethics,score in English, and average score in the

EXANI I.

Departamentof schoolservices

Score in Math 1, score in Physics 1, score inSocial Science 1, score in Humanities 1, scorein Writing and Reading 1, score in English 1,

and score in Computer 1.

Excellent: score between 9.5 and 10; Very good: scorebetween 8.5 and 9.4; Good: score between 7.5 and 8.4;Regular: score between 6.5 and 7.4; Sufficient: score between6.0 and 6.4; Poor: between 4.0 and 5.9; Very poor: less than 4.0and Not presented.

Then, all the information was integrated in a single datasetand it was saved in the .ARFF format of Weka [14]. Next, thewhole dataset was divided randomly into 10 pairs of trainingand test data files (maintaining the original class distribution).This way, each classification algorithm can be evaluated usingstratified tenfold cross-validation. So after preprocessing wehave a dataset with 77 attributes/variables of 670 students.

However, our dataset has two typical problems that normallyappear in these types of educational data. On the one hand,our data set has high dimensionality; that is, the number ofattributes or features becomes very large. Further, given a large

number of attributes, some will usually not be meaningful forclassification and it is likely that some attributes are correlated.On the other hand, the data are imbalanced, that is the majorityof students (610) passed and minority (60) failed. The problemwith imbalanced data arises because learning algorithms tendto overlook less frequent classes (minority classes) and onlypay attention to the most frequent ones (majority classes). Asa result, the classifier obtained will not be able to correctlyclassify data instances corresponding to poorly representedclasses.

We decide to carry out a study of feature selection to tryto identify which feature has the greatest effect on our outputvariable (academic status). There are a wide range of attributeselection algorithms that can be grouped in different ways.One popular categorization is one in which the algorithmsdiffer in the way they evaluate attributes and are classified as:filters, which select and evaluate features independently of thelearning algorithm and wrappers, which use the performanceof a classifier (learning algorithm) to determine the desirabilityof a subset [15]. Weka provides several feature selection algo-rithms from which we have selected the following ten: CfsSub-setEval, ChiSquaredAttributeEval, Consistency-SubsetEval,FilteredAttributeEval, OneRAttributeEval, FilteredSubsetEval,GainRatioAttributeEval, InfoGain-AttributeEval, ReliefFAt-tributeEval, SymmetricalUncert-AttributeEval. Table II showsthe results of applying 10 algorithms of feature selection. Theresults obtained were ranked by these 10 algorithms to selectthe best attributes from our 77 available attributes. To findthe ranking of the attributes, we counted the number of timeseach attribute was selected by one of the algorithms. Table IIIshows the frequency of each attribute. From this table onlythose with a frequency greater than two have been consideredby more than two feature selection algorithms. Finally, weselected only the attributes with frequency greater than orequal to two (attributes selected by at least two algorithms).In this way, we can reduce the dimensionality of our datasetfrom the original 77 attributes to only the best 15 attributes.

Finally, we have mentioned that our dataset is imbalanced.The problem of imbalanced data classification occurs when thenumber of instances in one class is much smaller than the num-ber of instances in another class or other classes. Traditionalclassification algorithms have been developed to maximize theoverall accuracy rate, which is independent of class distribu-tion; this means that the majority of class classifiers are in thetraining stage, which leads to low sensitivity classification ofminority class elements at the test stage. One way to solvethis problem is to act during the pre-processing of data bycarrying out a sampling or balancing of class distribution.There are several data balancing or rebalancing algorithms;one that is widely used and that is available in Weka as asupervised data filter is SMOTE (Synthetic Minority Over-sampling Technique). In the SMOTE algorithm, the minorityclass is over-sampled by taking each minority class sample andintroducing synthetic examples along the line segments joiningany or all of the k minority class nearest neighbors. Dependingon the amount of over-sampling required, neighbors from thek nearest neighbors are randomly chosen [16]. In our case,only the training files (with the best 15 attributes) have been

Page 4: Predicting School Failure and Dropout by Using Data Mining Techniques

10 IEEE JOURNAL OF LATIN-AMERICAN LEARNING TECHNOLOGIES, VOL. 8, NO. 1, FEBRUARY 2013

TABLE II

BEST ATTRIBUTES SELECTED

Algorithm Attributes Selected

CfsSubsetEval

Physical Disability; Age; Score inMath 1; Score in Physics 1; Score in

Social Science 1; Score in Humanities1; Score in Writing and reading 1;

Score in English 1; Score in Computer1.

ChiSquared-AttributeEval.

Score in Humanities 1; Score inEnglish 1; Score in Social Science 1;Score in Physics 1; Score in Math 1;

Score in Computer 1; Score in Writingand reading 1; Level of motivation

Consistency-SubsetEvalClassroom/Group; time spent doingexercises; Score in Humanities 1;

Score in English 1, Studying in group.

Filtered-AttributeEval

Score in Humanities 1; Score inEnglish 1; Score in Math 1; Score inSocial Science 1; Score in Physics 1;Score in Writing and reading 1; Scorein Computer 1; Level of motivation;GPA in secondary school; Score inHistory; Classroom/Group; average

score in EXANI I.

FilteredSubsetEval

Score in Math 1; Score in Physics 1;Score in Social Science 1; Score inHumanities 1; Score in Writing and

reading 1; Score in English 1; Score inComputer 1.

GainRatio-AttributeEval

Score in Math 1;Score in Humanities1; Score in English 1; Score in Social

Science 1; Score in Physics 1; Score inWriting and reading 1; Score in

Computer 1; Level of motivation;Marital status; Physical disability; GPA

in secondary school; Smoking habits

InfoGain-AttributeEval

Score in Humanities 1; Score inEnglish 1; Score in Math 1; Score inSocial Science 1; Score in Physics 1;Score in Writing and reading 1; Score

in Computer 1.

OneRAttributeEval

Score in Humanities 1; Score in SocialScience 1; Score in English 1; Score in

Computer 1; Score in Writing andreading 1; Score in Math 1; Score in

Physics 1; Level of motivation.

ReliefFAttributeEval

Score in Physics 1; Score in English 1;Score in Math 1; Score in Humanities

1; Score in Writing and reading 1;Score in Social Science 1; GPA in

secondary school; Score in Computer1; Level of motivation; Age; averagescore in EXANI I; Smoking habits.

SymmetricalUncert-AttributeEval

Score in Humanities 1; Score in Math1; Score in English 1; Score in Social

Science 1; Score in Physics 1; Score inWriting and reading 1; Score in

Computer 1.

rebalanced using the SMOTE algorithm, obtaining 50% Passstudents and 50% failed students and not rebalancing the testfiles. After performing all the previous tasks of pre-processing,we obtained the next tenfold cross validation files:

1) Ten training and test files with all attributes (77).2) Ten training and test files with only the best

attributes (15).3) Ten training and test files with only the best

attributes (15); the training files are rebalanced usingSMOTE.

TABLE III

MOST INFLUENTIAL ATTRIBUTES RANKED BY FREQUENCY OF

APPEARANCE

Attribute FrequencyScore in Humanities 1

Score in English 1Score in Social Science 1

Score in Math 1Score in Writing and reading 1

Score in Physics 1Score in Computer 1Level of motivation

GPA in secondary schoolSmoking habits

Average score in EXANI IAge

Physical disabilityClassroom/Group

1010999995322222

V. DATA MINING AND EXPERIMENTATION

This section describes the experiments and data miningtechniques used for obtaining the prediction models of stu-dents’ academic status at the end of the semester.

We performed several experiments in order to try to obtainthe highest classification accuracy. In a first experiment weexecuted 10 classification algorithms using all available infor-mation (77 attributes). In a second experiment, we usedonly the best attributes selected (15). In a third experiment,we repeated the executions by using re-balanced data files.In a final experiment we considered different costs in theclassification.

In this paper, decision trees and rules induction algorithmsare used as they are “white box” classification techniques;that is, they provide an explanation for the classification resultand can be used directly for decision making. A decision treeis a set of conditions organized in a hierarchical structure.An instance is classified by following the path of satisfiedconditions from the root of the tree until a leaf is reached,which will correspond with a class label. Rule inductionalgorithms usually employ a specific-to-general approach, inwhich obtained rules are generalized (or specialized) until asatisfactory description of each class is obtained. 10 commonlyused classical classification algorithms that are available in thewell-known Weka DM software have been used:

1) Five rule induction algorithms: JRip, which is a propo-sitional rule learner; NNge, which is a nearest neighbor-like algorithm; OneR, which uses the minimum-errorattribute for class prediction; Prism [17], which is analgorithm for inducing modular rules; and Ridor, whichis an implementation of the Ripple-Down Rule learner.

2) Five decision tree algorithms: J48 [18], which is analgorithm for generating apruned or unpruned C4.5decision tree; SimpleCart [19], which implements mini-mal cost-complexity pruning; ADTree [20], which is analternating decision tree; RandomTree, which considersK randomly chosen attributes at each node of the tree;and REPTree, which is a fast decision tree learner.

A decision tree can be directly transformed into a set ofIF-THEN rules (which are obtained by rule induction

Page 5: Predicting School Failure and Dropout by Using Data Mining Techniques

MÁRQUEZ-VERA et al.: PREDICTING SCHOOL FAILURE AND DROPOUT 11

TABLE IV

CLASSIFICATION RESULTS USING ALL ATTRIBUTES

Algorithm TP Rate TN Rate Acc GMJRip 97.0 81.7 95.7 89.0NNge 98.0 76.7 96.1 86.7OneR 98.9 41.7 93.7 64.2Prism 99.2 44.2 94.7 66.2Ridor 95.6 68.3 93.1 80.8ADTree 99.2 78.3 97.3 88.1J48 97.7 55.5 93.9 73.6RandomTree 98.0 63.3 94.9 78.8REPTree 97.9 60.0 94.5 76.6SimpleCart 98.0 65.0 95.1 79.8

algorithms) [18], which are one of the most popular formsof knowledge representation due to their simplicity and com-prehensibility. In this way a non-expert user of DM such a asteacher or instructor can directly use the output obtained bythese algorithms to detect students with problems (classifiedas Fail) and to make decisions about how to help them andprevent their possible failure.

In the first experiment, all the classification algorithms wereexecuted using tenfold cross-validation and all the availableinformation, that is, the original data file with 77 attributesof 670 students. The results with the test files (an averageof 10 executions) of classification algorithms are shown inTable IV. This table shows the rates or percentages of correctclassifications for each of the two classes: Pass (TP rate)and Fail (TN rate), the overall Accuracy rate (Acc) and theGeometric Mean (GM). It can be seen in Table IV that thepercentage of accuracy obtained for total accuracy (Acc) andfor Pass (TP rate) are high, but not for Fail (TN rate) andgeometic mean (GM). Specifically, the algorithms that obtainthe maximum values are: JRip (TN rate and GM) and ADTree(TP rate and Acc).

In the second experiment, we executed all the classificationalgorithms using tenfold cross-validation and the reduceddataset (with only the best 15 attributes). Table V showsthe results with the test files (the average of 10 executions)using only the best 15 attributes. When comparing the resultsobtained with the previous results obtained using all theattributes, that is, Table IV versus Table V, we can see thatin general all the algorithms have improved in some measures(TN rate and GM). Furthermore, with regard to the othersmeasures (TP rate and Accuracy) there are some algorithmsthat obtain a slightly worse or slightly better value, but theyare very similar in general to the previous ones. In fact, themaximum values obtained are now better than the previousones obtained using all attributes. Again the algorithms thatobtain these maximum values are JRip (TN rate and GM) andADTree (TP rate and Accuracy).

As we can see from Tables IV and V, TP rate values arenormally higher than TN rate values. However, we haven’tobtained yet a good classification of the minority class (Fail).And this can be due to the fact that our data are imbalanced.This feature of the data is not desirable because it affectsnegatively in the results obtained. Classification algorithmstend to focus on classifying the majority class in order to

TABLE V

CLASSIFICATION RESULTS USING THE BEST ATTRIBUTES

Algorithm TP Rate TN Rate Acc GMJRip 977 65.0 948 788NNge 987 783 969 871OneR 888 883 888 883Prism 998 371 947 59.0Ridor 979 70.0 954 814ADTree 982 867 972 921J48 967 75.0 948 848RandomTree 961 683 936 796REPTree 965 75.0 946 846SimpleCart 964 767 946 855

TABLE VI

CLASSIFICATION RESULTS USING DATA BALANCING

Algorithm TP Rate TN Rate Acc GMJRip 97.7 65.0 94.8 78.8NNge 98.7 78.3 96.9 87.1OneR 88.8 88.3 88.8 88.3Prism 99.8 37.1 94.7 59.0Ridor 97.9 70.0 95.4 81.4ADTree 98.2 86.7 97.2 92.1J48 96.7 75.0 94.8 84.8RandomTree 96.1 68.3 93.6 79.6REPTree 96.5 75.0 94.6 84.6SimpleCart 96.4 76.7 94.6 85.5

TABLE VII

CLASSIFICATION RESULTS USING COST-SENSITIVE CLASSIFICATION

Algorithm TP Rate TN Rate Acc GMJRip 96.2 93.3 96.0 94.6NNge 98.2 71.7 95.8 83.0OneR 96.1 70.0 93.7 80.5Prism 99.5 39.7 94.4 54.0Ridor 96.9 58.3 93.4 74.0ADTree 98.1 81.7 96.6 89.0J48 95.7 80.0 94.3 87.1RandomTree 96.6 68.3 94.0 80.4REPTree 95.4 65.0 92.7 78.1SimpleCart 97.2 90.5 96.6 93.6

obtain a good classification rate and so, they forget theminority class.

In the third experiment, we again executed all the clas-sification algorithms using tenfold cross-validation and therebalanced training files (using The SMOTE algorithm)with only the best 15 attributes. The results obtained afterre-executing the 10 classification algorithms using tenfoldcross-validation are summarized in Table VI. If we analyseand compare this table with the previous IV and V, wecan observe that over half of the algorithms have increasedthe values obtained in all the evaluation measures, andsome of them also obtain the new best maximum valuesin almost all measures except accuracy. The algorithms thathave obtained the best results are Prism, OneR and ADTreeagain.

A different approach to solving the problem of imbalanceddata classification is to apply cost-sensitive classification. Opti-mizing the classification rate without taking into consideration

Page 6: Predicting School Failure and Dropout by Using Data Mining Techniques

12 IEEE JOURNAL OF LATIN-AMERICAN LEARNING TECHNOLOGIES, VOL. 8, NO. 1, FEBRUARY 2013

TABLE VIII

SOME RULES DISCOVERED BY JRip USING THE BEST 15 ATTRIBUTES

AND CONSIDERING COST-SENSITIVE CLASSIFICATION

(Physics 1 = Poor) =>Academic Status = Fail(Humanities 1 = Not Presented) =>Academic Status = Fail(Math 1 = Not Presented) =>Academic Status = Fail(English 1 = Poor) and (Physics 1 = Not Presented) =>AcademicStatus = Fail(Reading and Writing 1 = Not Presented) =>Academic Status = Fail(Social Science 1 = Poor) and (Age = over 15 years) and (Class-room/group = 1M) =>Academic Status = Fail=>Academic Status = Pass

the cost of errors can often lead to suboptimal result becausehigh costs can result from the misclassification of a minorityinstance. In fact, in our particular problem, we are much moreinterested in the classification of Fail students (the minorityclass) than Pass students (the majority class). These costscan be incorporated into the algorithm and considered duringclassification. In the case of two classes, costs can be put intoa 2 × 2 matrix in which diagonal elements represent the twotypes of correct classification and the off-diagonal elementsrepresent the two types of errors.

The default cost matrix is [0, 1; 1, 0], where the maindiagonal has zeros (correct values) and the others values areequal to 1, indicating that they have the same cost or benefit.If the costs of any misclassification or benefit are different,then they must be indicated by a value distinct to 1.

Weka allows any classification algorithm to be made costsensitive by using the meta classification algorithm CostSensitive Classifier and setting its base classifier as the desiredalgorithm. In our problem, we have used the values of thematrix [0, 1; 4, 0] as the cost matrix, due to previous tests withdifferent values of costs in which this matrix obtained the bestresults. This matrix indicates that performing the classificationtakes into consideration that it is four times more importantto correctly classify Fail students than Pass students.

In the fourth experiment, we executed all the classificationalgorithms using tenfold cross-validation, considering differentcosts of classification and introducing a coast matrix (withonly the best 15 attributes). Table VII shows the results withtest files obtained after applying tenfold cross-validation. Onanalyzing and comparing Table VII with Table VI, some algo-rithms can be seen to obtain better values in some evaluationmeasures while other algorithms obtain worse values. So, thereis no clear general improvement. One algorithm (JRip) doesobtain the current best maximum values of the TN rate (93.3%)and GM (94.6%), which is very important in our problem. Inthis experiment the algorithms which get the best results arePrism, ADTree and SimpleCart.

VI. INTERPRETATION OF RESULTS

In this section, some examples of different rules discoveredby some of the algorithms are shown in order to comparetheir interpretability and usefulness for early identification ofstudents with risk of failing and for making decisions abouthow to help this student. These rules show us the relevantfactors and relationships that lead a student to pass or fail.

TABLE IX

TREE OBTAINED USING AD TREE, THE BEST 15 ATTRIBUTES AND

CONSIDERING COST-SENSITIVE CLASSIFICATION

: -0465| (1) Humanities 1 = Not Presented: 1824| (1) Humanities 1!= Not Presented: -0412|| (2) Physics 1 = Poor: 1415|| (2) Physics 1 != Poor: -0632||| (9) Reading and Writing 1 = Not Presented: 1224||| (9) Reading and Writing 1 != Not Presented: −052| (3) Social Science 1 = Poor: 1689| (3) Social Science 1 != Poor: −0245|| (4) English 1 = Poor: 1278|| (4) English 1 != Poor: −0322||| (5) Math 1= Not Presented: 1713||| (5) Math 1 != Not Presented: −0674||| (6) Social Science 1 = Not Presented: 1418||| (6) Social Science 1 != Not Presented: −0283|||| (8) English 1 = Not Presented: 1313|||| (8) English 1 != Not Presented: −0695| (7) Math 1 = Poor: 0758|| (10) Humanities 1 = Regular: −0473|| (10) Humanities 1 != Regular: 0757| (7) Math 1 != Poor: −0315Legend: -ve = Pass, +ve = Fail

TABLE X

RULES (FAIL) OBTAINED USING PRISM, THE 15 BEST ATTRIBUTES AND

CONSIDERING COST-SENSITIVE CLASSIFICATION

If Reading and Writing 1 = Not Presented then FailIf Social Science 1 = Not Presented and Humanities 1 = Not Presented thenFailIf Humanities 1 = Not Presented and Math 1 = Poor then FailIf Math 1 = Not Presented and Score in EXANI I = Very Poor then FailIf Reading and Writing 1 = Poor and Social Science 1 = Poor then FailIf Math 1 = Not Presented and English 1 = Not Presented then FailIf English 1 = Poor and Computer 1 = Regular then FailIf Social Science 1 = Poor and Physics 1 = Poor then FailIf Computer 1 = Poor and Math 1 = Poor then FailIf Social Science 1 = Not Presented and Classroom/group = 1R then FailIf English 1 = Poor and Classroom/group = 1G then FailIf Humanities 1 = Poor and Math 1 = Poor and Classroom/group = 1M thenFailIf Humanities 1 = Not Presented and Classroom/group = 1E then Fail

TABLE XI

TREE OBTAINED USING SIMPLECART, THE 15 BEST ATTRIBUTES AND

CONSIDERING COST-SENSITIVE CLASSIFICATION

Humanities 1 = (Not Presented)|(Poor)| Math 1 = (Not Presented)|(Poor): Fail|| Classroom/group = (1M): FailHumanities 1! = (Not Presented)|(Poor)| English 1 = (Poor)|(Not Presented)|| Classrom/group = (1G): Fail||| Level of motivation = (Regular): Fail||| Level of Motivation! = (Regular): Pass| English 1! = (Poor)|(Not Presented)|| Social Science 1 = (Poor)||| Classroom/group = (1R): Fail||| Classroom/group! = (1R): Fail|| Social Science 1! = (Poor): Pass

In the model shown in Table VIII it is observed that the algo-rithm JRip discovers few rules. With respect to the attributesthat are associated to Fail, they are mostly concerning marks,indicating that the student failed Physics or not presented forHumanities, Math, English or Reading and Writing. There areother attributes that indicate that students who failed are older

Page 7: Predicting School Failure and Dropout by Using Data Mining Techniques

MÁRQUEZ-VERA et al.: PREDICTING SCHOOL FAILURE AND DROPOUT 13

than 15 years old or they belong to a group 1M. The decisiontree of the Table IX shows that all the attributes concerningmarks appear with values of not presented, poor or regular. Itis also shown that subjects like Humanities and Social Science,which are normally easy to pass, are in the top of the tree.Inthe model shown in Table X it is observed that the Prismalgorithm finds a large number of rules. Also, it is shown thatin addition to the attributes concerning marks (and the valuesof not presented or poor) there are attributes that indicate thestudent belongs to a particular group (1R, 1G, 1M or 1E).The classification tree obtained bySimpleCart algorithm (seeTable XI) is smaller than the one obtained by the ADTree. Itcan be seen that together with the attributes concerning marks(and the values not presented, poor or regular) in Humanities,Math, English and Computer, other attributes also appear suchas if the level of motivation is low or not and if the studentbelongs to a particular group (1R, 1G, 1M).

Finally, it is important to note that no consensus has beendetected between the previous classification algorithms aboutthe existence of a single factor that most influences to thestudents’ failure. However, the following set of factors (whichmost appear in the models obtained) can be considered as themost influential: Poor or Not Presented in Physics and Math;Not Presented in Humanities and Reading and Writing; Poorin English and Social Science, age over 15 years and regularlevel of motivation.

VII. CONCLUSION

As we have seen, predicting student failure at school can bea difficult task not only because it is a multifactor problem (inwhich there are a lot of personal, family, social, and economicfactors that can be influential) but also because the availabledata are normally imbalanced. To resolve these problems, wehave shown the use of different DM algorithms and approachesfor predicting student failure. We have carried out severalexperiments using real data from high school students inMexico. We have applied different classification approachesfor predicting the academic status or final student performanceat the end of the course. Furthermore we have shown that someapproaches such as selecting the best attributes, cost-sensitiveclassification, and data balancing can also be very useful forimproving accuracy.

It is important to notice that gathering information andpre-processing data were two very important tasks in thiswork. In fact, the quality and the reliability of the usedinformation directly affects the results obtained. However, thisis an arduous task that involves a lot of time to do. Specifically,we had to do the pick out of data from a paper and pencilsurvey and we had to integrat data from three different sourcesto form the final dataset.

In general, regarding the DM approaches used and theclassification result obtained, the main conclusions are asfollows:

1) We have shown that classification algorithms cab be usedsuccessfully in order to predict a student’ s academicperformance and, in particular, to model the differencebetween Fail and Pass students.

2) We have shown the utility of feature selection techniqueswhen we have a great number of attributes. In ourcase, we have reduced the number of attributes usedfrom the 77 initially available attributes to the best 15attributes, obtaining fewer rules and conditions withoutlosing classification performance.

3) We have shown two different ways to address theproblem of imbalanced data classification by rebalancingthe data and considering different classification costs. Infact, rebalancing of the data has been able to improvethe classification results obtained in TN rate, Accuracy,and Geometric Mean.

Regarding the specific knowledge extracted from the clas-sification models obtained, the main conclusions are asfollows:

1) White box classification algorithms obtain models thatcan explain their predictions at a higher level of abstrac-tion by IF-THEN rules. In our case, induction rulealgorithms produce IF-THEN rules directly, and decisiontrees can be easily transformed into IF-THEN rules.IF-THEN rules are one of the most popular forms ofknowledge representation, due to their simplicity andcomprehensibility. These types of rules are easily under-stood and interpreted by non-expert DM users, suchas instructors, and can be directly applied in decision-making process.

2) Concerning the specific factor or attributes related withstudent failure, there are some specific values that appearmost frequently in the classification models obtained.For example, the values of scores/grades that appearmost frequently in the obtained classification rules arethe values “Poor”, “Very Poor”, and “Not Presented”in the subjects of Physics 1, Humanities 1, Math 1and English 1. Other factors frequently associated withfailing are being over 15 years of age, having more thanone sibling, attending evening classroom/group, having alow level of motivation to study, to live in a big city (withmore than 20 thousand inhabitants), and students whichconsider Math as a difficult subject. It is also striking thatthe failing grades for a subject like Humanities, that amajority of students usually pass, appear in the obtainedmodels.

3) In this study we have used the students’ marks andwe have not focused solely on social, economic, andcultural attributes for two main reasons. The first is thatwe obtained bad classification results when we did notconsider the marks. Secondly, the grades obtained bystudents have been previously used in a great number ofother similar studies [21], [22].

Starting from the previous models(rules and decision trees)generated by the DM algorithms,a system to alert the teacherand their parents about students who are potentially at riskof failing or drop out can be implemented. As an exampleof possible action, we propose that once students were foundat risk, they would be assigned to a tutor in order to providethem with both academic support and guidance for motivatingand trying to prevent student failure.

Page 8: Predicting School Failure and Dropout by Using Data Mining Techniques

14 IEEE JOURNAL OF LATIN-AMERICAN LEARNING TECHNOLOGIES, VOL. 8, NO. 1, FEBRUARY 2013

Finally, as the next step in our research, we aim to carry outmore experiments using more data and also from different edu-cational levels (primary, secondary, and higher) to test whetherthe same performance results are obtained with different DMapproaches. As future work, we can mention the following:

1) To develop our own algorithm for classifica-tion/prediction based on grammar using geneticprogramming that can be compared versus classicalgorithms.

2) To predict the student failure as soon as possible. Theearlier the better, in order to detect students at risk intime before it is too late.

3) To propose actions for helping students identified withinthe risk group. Then, to check the rate of the times itis possible to prevent the fail or dropout of that studentpreviously detected.

REFERENCES

[1] L. A. Alvares Aldaco, “Comportamiento de la deserción y reprobaciónen el colegio de bachilleres del estado de baja california: Caso plantelensenada,” in Proc. 10th Congr. Nat. Invest. Educ., 2009, pp. 1–12.

[2] F. Araque, C. Roldán, and A. Salguero, “Factors influencing universitydrop out rates,” Comput. Educ., vol. 53, no. 3, pp. 563–574, 2009.

[3] M. N. Quadril and N. V. Kalyankar, “Drop out feature of student datafor academic performance using decision tree techniques,” Global J.Comput. Sci. Technol., vol. 10, pp. 2–5, Feb. 2010.

[4] C. Romero and S. Ventura, “Educational data mining: A survey from1995 to 2005,” Expert Syst. Appl., vol. 33, no. 1, pp. 135–146, 2007.

[5] C. Romero and S. Ventura, “Educational data mining: A review of thestate of the art,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 40,no. 6, pp. 601–618, Nov. 2010.

[6] S. Kotsiantis, K. Patriarcheas, and M. Xenos, “A combinational incre-mental ensemble of classifiers as a technique for predicting students’performance in distance education,” Knowl. Based Syst., vol. 23, no. 6,pp. 529–535, Aug. 2010.

[7] J. Más-Estellés, R. Alcover-Arándiga, A. Dapena-Janeiro,A. Valderruten-Vidal, R. Satorre-Cuerda, F. Llopis-Pascual, T.Rojo-Guillén, R. Mayo-Gual, M. Bermejo-Llopis, J. Gutiérrez-Serrano, J. García-Almiñana, E. Tovar-Caro, and E. Menasalvas-Ruiz,“Rendimiento académico de los estudios de informática en algunoscentros españoles,” in Proc. 15th Jornadas Enseñanza Univ. Inf.,Barcelona, Rep. Conf., 2009, pp. 5–12.

[8] S. Kotsiantis, “Educational data mining: A case study for predictingdropout—prone students,” Int. J. Know. Eng. Soft Data Paradigms,vol. 1, no. 2, pp. 101–111, 2009.

[9] I. Lykourentzou, I. Giannoukos, V. Nikolopoulos, G. Mpardis, andV. Loumos, “Dropout prediction in e-learning courses through thecombination of machine learning techniques,” Comput. Educ., vol. 53,no. 3, pp. 950–965, 2009.

[10] A. Parker, “A study of variables that predict dropout from distanceeducation,” Int. J. Educ. Technol., vol. 1, no. 2, pp. 1–11, 1999.

[11] T. Aluja, “La minería de datos, entre la estadística y la inteligen-cia artificial,” Quaderns d’Estadística Invest. Operat., vol. 25, no. 3,pp. 479–498, 2001.

[12] M. M. Hernández, “Causas del fracaso escolar,” in Proc. 13th Congr.Soc. Española Med. Adolescente, 2002, pp. 1–5.

[13] E. Espíndola and A. León, “La deserción escolar en américa latina:Un Tema prioritario para la agenda regional,” Revista Iberoamer. Educ.,vol. 1, no. 30, pp. 39–62, 2002.

[14] I. H. Witten and F. Eibe, Data Mining, Practical Machine Learning Toolsand Techniques, 2nd ed. San Mateo, CA, USA: Morgan Kaufman, 2005.

[15] M. A. Hall and G. Holmes, “Benchmarking attribute selection techniquesfor data mining,” Dept. Comput. Sci., Univ. Waikato, Hamilton, NewZealand, Tech. Rep. 00/10, Jul. 2002.

[16] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,“Synthetic minority over-sampling technique,” J. Artif. Intell. Res.,vol. 16, pp. 321–357, Jun. 2002.

[17] J. Cendrowska, “PRISM: An algorithm for inducing modular rules,” Int.J. Man-Mach. Stud., vol. 27, no. 4, pp. 349–370, 1987.

[18] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA,USA: Morgan Kaufman, 1993.

[19] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classificationand Regression Trees. New York, USA: Chapman & Hall, 1984.

[20] Y. Freund and L. Mason, “The alternating decision tree algorithm,” inProc. 16th Int. Conf. Mach. Learn., 1999, pp. 124–133.

[21] L. Fourtin, D. Marcotte, P. Potvin, E. Roger, and J. Joly, “Typology ofstudents at risk of dropping out of school: Description by personal,family and school factors,” Eur. J. Psychol. Educ., vol. 21, no. 4,pp. 363–383, 2006.

[22] L. G. Moseley and D. M. Mead, “Predicting who will drop out of nursingcourses: A machine learning exercise,” Nurse Educ. Today, vol. 28, no. 4,pp. 469–475, 2008.

Carlos Márquez-Vera received the M.Sc. degree inphysics education from the University of Havana,Havana, Cuba, in 1997. He is currently pursuingthe Ph.D. degree with the University of Córdoba,Cordoba, Spain.

He is currently a Professor with the PreparatoryUnit Academic, Autonomous University of Zacate-cas, Zacatecas, Mexico. His current research inter-ests include educational data mining.

Cristóbal Romero Morales received the B.Sc. andPh.D. degrees in computer science from the Univer-sity of Granada, Granada, Spain, in 1996 and 2003,respectively.

He is currently an Associate Professor with theDepartment of Computer Science and NumericalAnalysis, University of Cordoba, Cordoba, Spain.He has authored or co-authored more than 50 papersin journals and conferences, including 20 papers ininternational journals. His current research interestsinclude applying data mining in e-learning systems.

Dr. Romero is a member of the Knowledge Discovery and IntelligentSystems Research Laboratory. He is a member of the International EducationalData Mining society.

Sebastián Ventura Soto received the B.Sc. andPh.D. degrees from the University of Cordoba, Cor-doba, Spain, in 1989 and 1996, respectively.

He is currently an Associate Professor with theDepartment of Computer Science and NumericalAnalysis, University of Cordoba, where he is theHead of the Knowledge Discovery and IntelligentSystems Research Laboratory. He has authored orco-authored more than 90 papers in journals andconferences, including 35 papers in internationaljournals. He has been engaged in 12 research

projects (being the Coordinator of three projects), supported by the Spanishand Andalusian governments and the European Union, on evolutionarycomputation, machine learning, and data mining and its applications. Hiscurrent research interests include soft-computing, machine learning, and datamining and its applications.