predicting ndum student's academic performance using data mining techniques (2009)

5
Predicting NDUM Student’s Academic Performance Using Data Mining Techniques Muslihah Wook Department of Computer Science, Faculty of Science and Defence Technology National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia [email protected] Norshahriah Wahab Department of Computer Science, Faculty of Science and Defence Technology National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia [email protected] Nor Fatimah Awang Department of Computer Science, Faculty of Science and Defence Technology National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia [email protected] Yuhanim Hani Yahaya Department of Computer Science, Faculty of Science and Defence Technology National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia [email protected] Mohd Rizal Mohd Isa Department of Computer Science, Faculty of Science and Defence Technology National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia [email protected] Hoo Yann Seong Department of Computer Science, Faculty of Science and Defence Technology National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia [email protected] Abstract - The ability to predict the students’ academic performance is very important in institution educational system. Recently some researchers have been proposed data mining techniques for higher education. In this paper, we compare two data mining techniques which are: Artificial Neural Network (ANN) and the combination of clustering and decision tree classification techniques for predicting and classifying students’ academic performance. The data set used in this research is the student data of Computer Science Department, Faculty of Science and Defence Technology, National Defence University of Malaysia (NDUM). Keywords- data mining, clustering, decision tree, artificial neural network. I. INTRODUCTION Data mining techniques have been applied in many applications such as banking, fraud detection and telecommunications [1]. Recently the data mining methodologies were used to enhance and evaluate the higher education tasks. Some researchers have proposed methods and architectures using data mining for higher education [2],[3],[4],[5]. The aim of this research is to identify the attributes that influence and affect the performance of undergraduate students after their first year degree examinations. The data set used in this research is the student data of Computer Science Department, Faculty of Science and Defence Technology, National Defence University of Malaysia (NDUM). We will compare two data mining techniques which are: Artificial Neural Network (ANN) and the combination of clustering and decision tree classification techniques. ANN technique is chosen for this research based on the study done by [6]. The study compared three model, ANN, decision tree and linear regression. Students’ demographic profile and the CGPA for the first of the undergraduate studies are used as the predictor variable for the students’ academic performance. The comparison results proved that the ANN able to produce accurate results of students’ academic performance in UiTM, Shah Alam. One of the main goals in applying the data clustering methods was to group students in clusters with dissimilar behavior; the students from the same cluster embrace the closest behavior, and the ones from different clusters have the most different one [7]. While the decision tree classification technique was chosen as suggested by researcher in [8], classification is the most modeling function to be used since it can be used to find the relationship between a specific variable, target variable and other variables. By combining these two techniques, we will apply a two-phase data mining based method in 2009 Second International Conference on Computer and Electrical Engineering 978-0-7695-3925-6/09 $26.00 © 2009 IEEE DOI 10.1109/ICCEE.2009.168 359 2009 Second International Conference on Computer and Electrical Engineering 978-0-7695-3925-6/09 $26.00 © 2009 IEEE DOI 10.1109/ICCEE.2009.168 357 Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on July 19,2010 at 02:45:54 UTC from IEEE Xplore. Restrictions apply.

Upload: zaihismachec2828

Post on 28-Nov-2014

470 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting NDUM Student's Academic Performance Using Data Mining Techniques (2009)

Predicting NDUM Student’s Academic Performance Using Data Mining Techniques

Muslihah Wook

Department of Computer Science, Faculty of Science and Defence Technology

National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia [email protected]

Norshahriah Wahab

Department of Computer Science, Faculty of Science and Defence Technology

National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia [email protected]

Nor Fatimah Awang

Department of Computer Science, Faculty of Science and Defence Technology

National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia

[email protected]

Yuhanim Hani Yahaya

Department of Computer Science, Faculty of Science and Defence Technology

National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia [email protected]

Mohd Rizal Mohd Isa

Department of Computer Science, Faculty of Science and Defence Technology

National Defence University of Malaysia, 57000 Kuala Lumpur, Malaysia

[email protected]

Hoo Yann Seong Department of Computer Science, Faculty of Science

and Defence Technology National Defence University of Malaysia, 57000

Kuala Lumpur, Malaysia [email protected]

Abstract - The ability to predict the students’ academic performance is very important in institution educational system. Recently some researchers have been proposed data mining techniques for higher education. In this paper, we compare two data mining techniques which are: Artificial Neural Network (ANN) and the combination of clustering and decision tree classification techniques for predicting and classifying students’ academic performance. The data set used in this research is the student data of Computer Science Department, Faculty of Science and Defence Technology, National Defence University of Malaysia (NDUM).

Keywords- data mining, clustering, decision tree, artificial neural network.

I. INTRODUCTION

Data mining techniques have been applied in many applications such as banking, fraud detection and telecommunications [1]. Recently the data mining methodologies were used to enhance and evaluate the higher education tasks. Some researchers have proposed methods and architectures using data mining for higher education [2],[3],[4],[5]. The aim of this research is to identify the attributes that influence and affect the performance of undergraduate students after their first year degree examinations. The data set used in this research is the

student data of Computer Science Department, Faculty of Science and Defence Technology, National Defence University of Malaysia (NDUM). We will compare two data mining techniques which are: Artificial Neural Network (ANN) and the combination of clustering and decision tree classification techniques.

ANN technique is chosen for this research based on the study done by [6]. The study compared three model, ANN, decision tree and linear regression. Students’ demographic profile and the CGPA for the first of the undergraduate studies are used as the predictor variable for the students’ academic performance. The comparison results proved that the ANN able to produce accurate results of students’ academic performance in UiTM, Shah Alam.

One of the main goals in applying the data clustering methods was to group students in clusters with dissimilar behavior; the students from the same cluster embrace the closest behavior, and the ones from different clusters have the most different one [7]. While the decision tree classification technique was chosen as suggested by researcher in [8], classification is the most modeling function to be used since it can be used to find the relationship between a specific variable, target variable and other variables. By combining these two techniques, we will apply a two-phase data mining based method in

2009 Second International Conference on Computer and Electrical Engineering

978-0-7695-3925-6/09 $26.00 © 2009 IEEE

DOI 10.1109/ICCEE.2009.168

359

2009 Second International Conference on Computer and Electrical Engineering

978-0-7695-3925-6/09 $26.00 © 2009 IEEE

DOI 10.1109/ICCEE.2009.168

357

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on July 19,2010 at 02:45:54 UTC from IEEE Xplore. Restrictions apply.

Page 2: Predicting NDUM Student's Academic Performance Using Data Mining Techniques (2009)

such a way that the result of clustering is the input to the decision tree classification.

This paper is organized as follows: Section 2 briefly describes the problem statement of this research, section 3 describes the background of this research, section 4 details the methodology, and finally the conclusions and further research are outlined.

II. PROBLEM STATEMENT Undergraduate student’s performance is a long

standing issue in higher education and a great deal of research over the past 75 years [9]. At the end of each semester, students’ result will be analyzed in order to evaluate students’ academic performance. At NDUM the Academic Affair is responsible in managing the examination and the results of the students. It has been observed that most of the student’s performance is not encouraging in which only small number of students obtained high Grade Point Average (GPA). The analyzed results show students are apparently weak in certain grouping of courses which contribute to poor GPA. Due to the scenario, tendency of students to churn or quit from the university is high. This situation could introduce bad image to NDUM especially and Defence Ministry generally.

III. BACKGROUND A. Data Mining

Gartner Group define data mining as “the process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data stored in repositories and by using pattern recognition technologies as well as statistical and mathematical techniques.” Data mining does not intend to replace traditional statistics. Rather, data mining is an extension of statistics, and statistics is an integral component in data mining [10],[11]. Data mining actually is a combination of machine learning, statistical analysis, modeling techniques and database technology. Thus, data mining capable to finds patterns and subtle relationships in data and infers rules that allow the prediction of future results.

Meanwhile, according to [12], “data mining is the process of automatically extracting useful information and relationships from immense quantities of data. In its purest form, data mining doesn't involve looking for specific information. Rather than starting from a question or a hypothesis, data mining simply finds patterns that are already present in the data.” Author in [13] said that “these patterns are then built into data mining models and used to predict individual behavior with high accuracy. For example, data mining may give an institution the information necessary to take action before a student drops out, or to efficiently allocate

resources with an accurate estimate of how many students will take a particular course.”

B. Data Mining in Higher Education

Universities are one of institution that have many data such as regarding the enrolment of students each year, academic performance, alumni etc. Usually, the past data has not been used since they do not realize what the hidden values behind the data are and they do not know how to use the data as well as why these data are so important for the future use. Therefore, these institutions require an important amount of significant knowledge mined from its past and current data sets using special methods and processes [14]. Then, when data mining was introduced, the application of data mining techniques was boost in many areas such as business, telecommunication and banking as well as educational area.

In the educational area, data mining was defined as “the process of converting raw data from educational systems to useful information that can be used to inform design decisions and answer research questions” [15]. According to [16] data mining is an analytic approach that “capitalizes on the advances of technology and the extreme richness of data in higher education for improving research and decision making through uncovering hidden trends and patterns that lend them to predicative modeling using a combination of explicit knowledge base, sophisticated analytical skills and academic domain knowledge”. C. Students’ Academic Performance

The understanding, prediction and prevention of the academic failure among students have long been debated for each higher education institution. Such study that has been done by [17] attempted to classify students into three groups: the 'low-risk' students, with a high probability of succeeding; the 'medium-risk' students, who may succeed thanks to the measures taken by the university; and the 'high-risk' students, with a high probability of failing (or dropping out). As consequences, the gaining results able to classify students into three groups and therefore the educator will able to give more attention to the ‘high-risk’ students such as extra classes, test, tutorial and etc. At the same time, this process facilitates the drawing up the students’ profile based on their academic performance and failure risk. Another study has been conducted by [2]. They developed a model allows the decision makers to better predict which students are less likely to perform well in that specific course, or those who are less likely to be successful in it.

360358

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on July 19,2010 at 02:45:54 UTC from IEEE Xplore. Restrictions apply.

Page 3: Predicting NDUM Student's Academic Performance Using Data Mining Techniques (2009)

IV. METHODOLOGY In the data mining literature, various "general

frameworks" have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements. One such model, CRISP-DM (Cross-Industry Standard Process for data mining) was proposed in the mid-1990s by a European consortium of companies to serve as a non-proprietary standard process model for data mining. The CRISP-DM methodology consists mainly of six steps: understanding the higher education objective, collecting the educational data, preparing the data, building the models, evaluating the model using one of the evaluation methods, and finally deployment which using the model for future prediction of the student performance. Figure 1 shows the research framework for this study.

Figure 1. Research Framework

A. Project Understanding

The initial step is the understanding of project domain mainly regarding to the students’ academic performance. This area of study is very complex and continuous processes needed to be focus on. The exam failure among NDUM students must be investigated, predicted and prevented in order to obtain the high quality of students graduated from this university. We have set our main objective is to choose the best technique that able to be as a model to predict students’ performance based on their academic result. The model should be able to classify students into groups of successful and unsuccessful students. Therefore the knowledge that can be extracted from this process is the patterns of previously successful and unsuccessful students. By identifying these students known, we are able to decide which type of students are more successful

than others and provide academic help for those who are less likely to be successful. B. Data Collection

The data used for this research is the student data of Computer Science Department, Faculty of Science and Defence Technology, National Defence University of Malaysia (NDUM). This research will focus on 85 students’ intake Sem I 2008/2009. We use primary data in order to complement the secondary data of the students. The primary data is the relevant features from each of student that must be collected using a questionnaire. The following is a partial list of the groups of features (fields) selected for this studies. They are:

• Demographics: age, gender, religion, race, secondary school, home town etc.

• Education background: mode of entry (SPM/STPM/Matriculation), previous qualification results, MUET’s score, computer skill, name and number of courses taken, total credit taken, majoring, number of course repetition etc.

• Personality: motivation of study, reading level, learning environment and style, interest etc.

While the secondary data is about the detail of students’ previous results such as CPA, CGPA, Grade Points by course type etc. that obtained from the Academic Affair, NDUM. C. Data Preparation

During data collection, the relevant data is gathered and the quality of data must be verified. Usually, the assembled data contains of missing or incomplete attribute, noisy (containing errors, or outlier values that deviate from expected), and inconsistent of data are common. Therefore, the collected data must be cleaned and transformed before it can be utilized in data mining system since data mining should process cleaned data in order to come out with better and or quality results. Data cleaning involves several of processes such as filling in missing values; smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Then, the cleaned data are transformed into a form of table that is suitable for data mining model. The cleaned data will be divided into two; training or learning data (60%) and the rest is for validating the data. These training data is applied to develop the model while the validated data is used to verify the chosen model. D. Modeling

As mention earlier, we proposed two techniques which are best suited in reaching our main objectives,

361359

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on July 19,2010 at 02:45:54 UTC from IEEE Xplore. Restrictions apply.

Page 4: Predicting NDUM Student's Academic Performance Using Data Mining Techniques (2009)

mainly neural network and combination of clustering and decision tree techniques. The gaining results from each of the techniques will be compared and the best technique will be chosen as the model of this research. The descriptions of the two techniques are as follow:

i. Artificial Neural Network

Neural networks offer a mathematical model that attempts to mimic the human brain [25]. Knowledge is represented as a layered set of interconnected nodes. The input to individual neural network nodes must be numeric and fall in the closed interval range from 0 to 1 [25]. Each attribute of students must be normalized such as age must be divided by 100. While the student’s gender and race are identified by binary inputs. Neural network technologies such as feed forward networks as illustrated in Figure 2 (often referred to as back propagation nets) have demonstrated promising capability for prediction [22, 23, 24]. In attempts to predict student’s academic performance, student’s data such as demographics, educational background and their personality must be considered and transformed into the required range from 0 to 1. The input data of students from the input layer will be calculated using the sigmoid function then the value of the attributes will be transfer to the hidden layer and lastly the output layer will appear the prediction value of the student’s performance either successful or unsuccessful profile.

Figure 2. Feed-Forward Neural Network [25]

ii. Clustering and Decision Tree

Unsupervised clustering technique can be described as the process of organizing objects in a database into clusters/groups such that objects within the same cluster have a high degree of similarity, while objects belonging to different clusters have a high degree of dissimilarity [19]. For the clustering process we utilized the FarthestFirst method based on K-means algorithm. We specified the parameter k, the number of clusters to be sought. For this theme the k parameter was 2, corresponding to the two groups of students we were interested in building the successful and unsuccessful student profiles: the ones who passed all exams and the ones who failed one or more exams. Then k points were chosen at random as

cluster centers. All instances were assigned to their closest cluster center according to the ordinary Euclidean distance metric. Next the centroid of the instances in each cluster was calculated, and these centroids were taken to be new center values for their respective clusters. Finally, the whole process was repeated with the new cluster centers. Iteration continued until the same points were assigned to each cluster in consecutive rounds, at which stage the cluster centers have stabilized and would remain the same [20].

Unfortunately, the cluster model has one drawback; there are no explicit rules to define each cluster. The model obtained by clustering is thus difficult to implement, and there is no clear understanding of how the model assigns clusters IDs or centroid value [21]. Therefore, we propose to employ the decision tree that may give a simpler model of classes. A decision tree is tree-shaped structure that represents sets of decisions. These decisions generate rules for the classification of a dataset. Trees develop arbitrary accuracy and use validation data sets to avoid spurious detail [21]. They are easy to understand and modify. Moreover, the tree representative is more explicit, easy-to-understand rules for each cluster of student’s performance. The classes in the decision tree are cluster IDs obtained in the first step of the method. The decision tree represents the knowledge in the form of IF-THEN rules. Each rule can be created for each path from the root to a leaf. The leaf node holds the class prediction [21].

E. Evaluation

Before proceeding to final deployment of the model, it is important to evaluate the model. This step is very significant since the representative of the model purposely is to predict the students’ academic performance must be proven. Then, a decision on the use of the data mining results should be reached. Moreover, there are major challenges to cultivating the institutional for best practices for using this model. Therefore, the researchers are restricted to maintaining and updating the model usage concurrently with the associative data of students since students data are always change for each semester and year.

F. Deployment

As the final stage in CRISP-DM, new data sets will be applied to the model selected in the model building stage to generate predictions or estimates of the expected outcome. Hence, a deployment of neural network or combination of clustering and decision tree model is focuses on making information and insights available reliably to the educational

362360

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on July 19,2010 at 02:45:54 UTC from IEEE Xplore. Restrictions apply.

Page 5: Predicting NDUM Student's Academic Performance Using Data Mining Techniques (2009)

institution. The reporting of the student’s prediction will give a lot of benefits to students as well as the institutional. For example, if there are high number of students that already fail in the current semester, the institutional should take a necessary action to prevent the students from getting fail in the next semester such as doing an intensive class or extra work and exercise to the student.

V. CONCLUSION

Predicting students’ academic performance is great concern to the higher education. Recently data mining can be used in a higher educational system to predict the students’ academic performance. This research attempts to use data mining techniques to predict and classify students’ academic performance in NDUM. Two techniques will be compared: Artificial Neural Network (ANN) and the combination of clustering and decision tree classification techniques. The technique that gives accurate prediction and classification will be chosen as the model for this research. Using the proposed model, the patterns that influence or affect the student’s academic performance will be identified.

REFERENCES [1] Han, J., Kamber, M. (2001) “Data Mining: Concepts

and Techniques”. Morgan Kaufmann Publishers. [2] Delavari, N., Beikzadeh, M.R. (2004) “A New Model

for Using Data Mining in Higher Educational System”, 5th International Conference on Information Technology based Higher Education and Training: ITEHT ’04, Istanbul, Turkey, 31st May-2nd Jun 2004.

[3] Varapron, P. et al. (2003) “Using Rough Set theory for Automatic Data Analysis”. 29th Congress on Science and Technology of Thailand.

[4] Mierle, K., Laven, K., Roweis, S., Wilson, G. (2005) “Mining Student CVS Repositories for Performance Indicators”.

[5] Delavari, N., Beikzadeh, M.R., Amnuaisuk, S. (2005) “Application of Enhanced Analysis Model for Data Mining Processes in Higher Educational System” 6th Annual International Conference: ITEHT , July 7-9, 2005, Juan Dolio, Dominican Republic.

[6] Ibrahim, Z. and Rusli, D. (2007) “Predicting Students’ Academic Performance: Comparing Artificial Neural Network, Decision Tree and Linear Regression”, 21st Annual SAS Forum.

[7] Bresfelean, V.P., Bresfelean, M., Ghisoiu, N. (2006) “Continuing education in a future EU member, analysis and correlations using clustering techniques”, Proceedings of EDU'06 International Conference, Tenerife, Spain, pg. 195-200.

[8] Delavari, N. (2005) “Application of Enhanced Analysis Model for Data Mining Processes in Higher Educational System”, IEEE.

[9] Reason, R.D. (2003), “Student Variables That Predict Retention: Resent Research and New Development”, NASPA Journal, pg 172 – 191.

[10] Luan, J. (2003) “Developing learner concentric learning outcome typologies using clustering and decision trees of data mining”, Presentation at 43rd AIR Forum, Tampa, FL.

[11] Zhao, C., & Luan, J. (2006). “Data mining: Going beyond traditional statistics”, In J. Luan, & C. M. Zhao, (Eds), Chapter 1 of Data mining in action: Case studies of enrollment management, New Directions for Institutional Research, No. 131. San Francisco: Jossey-Bass.

[12] Rubenking, N. (2001) “Hidden Messages”, PC Magazine.

[13] Luan, J. (2004) “Data Mining Applications in Higher Education”, SPSS Exec. Report. http://www.spss.com/home_page/wp2.htm

[14] Bresfelean V.P. (2009) “Data Mining Applications in Higher Education and Academic Intelligence, Theory and Novel Applications of Machine Learning”, Book edited by: Meng Joo Er and Yi Zhou, ISBN 978-3-902613-55-4, pg. 376, I-Tech, Vienna, Austria.

[15] Heiner, C., Baker, R., Yacef, K. (2006), Preface. In: Workshop on Educational Data Mining at the 8th International Conference on Intelligent Tutoring Systems (ITS 2006), Jhongli, Taiwan.

[16] Luan, J. (2002) “Data mining: Predictive modeling & clustering essentials”, Presentation at the 44th AIR Forum, Toronto, Canada.

[17] Vandamme J.P., Meskens N., Superby J.F. (2007) “Predicting Academic Performance by Data Mining Methods”, Education Economics, Volume 15, Issue 4, pg. 405 – 419.

[18] Kalles D., Pierrakeas C.(2004) “Analyzing student performance in distance learning with genetic algorithms and decision trees”, Hellenic Open University, Patras, Greece.

[19] San, O.M., Huynh, V.N., Nakamori, Y. (2004) “An Alternative Extension of The K-Means Algorithm for Clustering Categorical Data”, Int. J. Appl. Math. Comput. Sci., Vol. 14, No. 2, p. 241–247.

[20] Bresfelean, V.P., Bresfelean, M., Ghisoiu, N. (2008), “Determining Student’s Academic Failure Profile Founded on Data Mining Methods”, Proceedings of the ITI 2008 30th Int. Conf. on Information Technology Interfaces, June 23-26, Cavtat, Croatia, p. 317 – 322.

[21] Borzemski, L., (2006) “The Use of Data Mining to Predict Web Performance”, Cybernetics and Systems: An International Journal, 37: p. 587–608.

[22] Lapedes, A. and Farber, R., (1988), "How neural nets work," Evolution, Learning, and Cognition, pages 331-345, World 10Scientific, Singapore.

[23] Moody, J., (1989), "Fast learning in multi-resolution hierarchies," Advances in Neural Information Processing Systems, volume 1, pages 29-39, Denver, Morgan Kaufmann, San Mateo.

[24] Werbos, PJ., (1990), "Backpropagation through time: What it does and.. how to do it," Proceedings of the IEEE, volume 78, p. 1550-1560.

[25] R.J. Roiger and M.W. Geatz, (2003), “Data Mining: A Tutorial-based Primer. U.S: Addison-Wesley, p. 246 – 250.

363361

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on July 19,2010 at 02:45:54 UTC from IEEE Xplore. Restrictions apply.