utilization of learning analytics to obtain … · teresting correlations between features. the...

FACULTY OF SCIENCE

UTILIZATION OF LEARNING ANALYTICS TOOBTAIN PEDAGOGICALLY MEANINGFUL

INFORMATION FROM DATA AND PRESENTINGTHE INFORMATION TO STUDENTS

AuthorGerben van der Huizen

10460748

SupervisorBert Bredeweg

June 26, 2015

BSc Kunstmatige intelligentieAfstudeerproject BSc KI

The NetherlandsScience Park 904

1098 XH Amsterdam

Contents

1 Abstract 3

2 Introduction 3

3 Related work 4

4 Research methods 64.1 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.2 Data set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.3 Data processing (Data set 1) . . . . . . . . . . . . . . . . . . . . 7

4.3.1 Description of the activity and grade data set . . . . . . . 74.3.2 Extracted features from Data set 1 . . . . . . . . . . . . . 9

4.4 Data set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.5 Data processing (Data set 2) . . . . . . . . . . . . . . . . . . . . 11

4.5.1 Description of the exported Data set 2 . . . . . . . . . . . 124.6 Prediction model . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.6.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 134.6.2 Feature selection . . . . . . . . . . . . . . . . . . . . . . 134.6.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . 14

4.7 Feature analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.8 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Results (Data set 1) 165.1 Clustering results . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2 Feature selection results . . . . . . . . . . . . . . . . . . . . . . . 185.3 Classification results . . . . . . . . . . . . . . . . . . . . . . . . 195.4 Feature analysis results . . . . . . . . . . . . . . . . . . . . . . . 22

6 Results (Data set 2) 246.1 Prediction model results . . . . . . . . . . . . . . . . . . . . . . 246.2 Feature analysis results . . . . . . . . . . . . . . . . . . . . . . . 26

7 Visualization 277.1 Assign visualizations based on predictions . . . . . . . . . . . . . 277.2 Alternative approaches to data visualization . . . . . . . . . . . . 307.3 GUI/Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8 Conclusion 31

1

9 Discussion and future work 32

References 34

2

1 Abstract

Finding ways to extract pedagogically meaningful information from student data isa common issue in the field of learning analytics and educational data mining. Oneof the ways we can use this information is by presenting the data in such a way thatstudents can use it to improve their learning strategies. In this paper different meth-ods for extracting information from student data and methods for presenting thisinformation to students are discussed. A model for predicting student performanceand feature analysis methods were used to extract information from the data anddifferent ideas based on previous research were investigated for presenting datavisualizations to students. The methods were tested on two data sets, one contain-ing student data from carrying out assignments in Coach and the other consistedof Blackboard data of assignment scores and clicks on Blackboard. Firstly, fromtesting the prediction model on both data sets it was concluded that only the Black-board data showed potential for being able to make accurate predictions of studentperformance. Secondly, feature analysis resulted in finding correlations betweenfeatures from which interesting information could be derived. Finally, the findingsof this study showed that the investigated ideas for visualizations should first beevaluated by students before they can be implemented in a dashboard applicationfor presenting visual representations of student data.

2 Introduction

It is a challenge for universities and other institutions to determine how to under-stand and make use of educational data generated by students (Siemens, 2013).Both learning analytics (LA) and educational data mining (EDM) investigate howto use the large amount of data students supply to their institution. LA can bedescribed as a set of methods for collecting and analyzing data to find ways toimprove the learning environment of students from which the data was collected,whereas EDM is generally regarded as applying data mining techniques to largesets of student data to detect patterns which could have been missed otherwise (Pa-pamitsiou and Economides, 2014). One of the related issues to LA and EDM isdiscovering ways of analysing the data of students to gain pedagogically meaning-ful information regarding student behaviour and learning strategies (Papamitsiouand Economides, 2014).In this paper we will try to use data analysis methods to find out how we can ex-tract meaningful information from student data and report on these effectively tostudents. The research involves gaining insight into which kind of features or at-tributes can be extracted from the data and, with data analysis, to discover which

3

of these features are important for improving student learning strategies. Studentswill also be categorized by a clustering algorithm to investigate whether this tech-nique can help with the identification of hidden patterns and relationships in thestudent data. Furthermore, a prediction model based on student performance cate-gories assigned via clustering will be created using the extracted features, allowingfor the predictive ability of both the utilized classification algorithms and the ex-tracted features to be investigated. Finally, the research will also involve a reviewon how we can present student data in unique and insightful ways with visualiza-tions. Since these ideas for visualizations can’t be tested on students and their data,the results of presenting useful visualizations to students will be based on expe-rience with the data and previous research performed on visualizations. Throughthe LA and EDM methods mentioned above, and with the available student data itwill be possible to design a LA application to create the possibility for improvingthe learning strategies of students. This platform can then later be evaluated by stu-dents or simply serve as a guideline for designing learning environment systemswith student data.The contents of the project report is organised as follows: Firstly, the related workis reviewed. This section contains small summaries of papers and elaborates whythese papers are relevant to this project. Then, the resources and research meth-ods are discussed. The resources section describes which data, programming lan-guages and analysis tools were used in this project, the method section describeshow these resources were used to perform experiments. Next, the obtained resultsfrom the experiments which were described in research methods are presented. Inthe conclusion the general results are summarized as well as possible extensions ofthis research.

3 Related work

The review by Papamitsiou and Economides (2014) provides a systematic com-parison of recent LA and EDM research. The paper classifies the research byanalysis method, learning settings and research objective. The goal is to capturethe strengths and weaknesses of LA research and to help identify which problemsshould be addressed in future research. One of the issues discussed in the reviewpaper which is directly related to this project, is finding ways to extract informa-tion from student data. The paper does not explain this issue in a lot of detail, butprovides references to research papers which do provide more information on theissue.Marquez-Vera et al. (2013) is an example of a paper which provides insight on howto (pre-)process a LA data set. The paper describes how to perform data cleaning,

4

how to use WEKA to test feature selection algorithms for reducing dimensionalityand how to use the SMOTE algorithm to balance imbalanced data (when the num-ber of instances of certain classes are small compared to others). The pre-processeddata is then used to predict student failure by testing a number of supervised learn-ing algorithms for classification. The clustering of students on their performancewith the k-means algorithm is discussed in Shovon et al. (2012). A small set oftraining data from students is used to divide the students into four performancebased classes. This allows a teacher, based on the results of the clustering algo-rithm, to decide in which category each student belongs and which kind of learn-ing approach is suitable for the chosen category. In our research classification withthe SMOTE algorithm, feature selection and k-means clustering were applied tostudent data by being part of a prediction model. The different techniques from thepapers on classification (Marquez-Vera et al., 2013) and clustering (Shovon et al.,2012) were implemented in the prediction model which enabled it to classify stu-dents on their performance category. The predictions made with the model provideinformation on the predictive ability of the data and its features.Klerkx et al. (2014) shows how visualization techniques are becoming increas-ingly more important in the field of LA. The main goal of the paper is to describewhich of the existing visualization techniques can be used to enhance the learn-ing environments of students. Santos et al. (2013) demonstrates experiments witha LA application called StepUp!, which students can use to view their own learn-ing activity through different visualizations of data. The paper describes differentbrainstorming sections where students helped to identify the issues they had withthe application, which assisted the researchers with adding new functionality toStepUp!. An example of a LA dashboard for document recommendations and visu-alizing student data is given in Govaerts et al. (2010). The dashboard can be used byboth teachers and students to improve self-monitoring for learners, awareness forteachers and students, time tracking and creating learning resource recommenda-tions (documents containing information on a certain subject). The dashboard wasevaluated in two different testing sessions with students. At the end of each testingsession, the user satisfaction was analyzed by asking students to give their opinionin a survey. It was concluded that students found the dashboard to be useful, but itwas difficult to determine whether the dashboard actually improved the learning ofstudents. The two papers about different visualization dashboards provided ideasfor designing such a dashboard and emphasized the necessity of including studentevaluations to test the effectiveness of the visualizations and dashboards. To de-velop ideas for presenting data visualizations to students, the methods referencedin Klerkx et al. (2014) were utilized to provide insight on which of these methodshave already been implemented in recent years. The results from the research pa-pers about visualizations were mainly used to substantiate some of the statements

5

made about visualization methods in section 7.

4 Research methods

The contents of the research methods section is organised as follows: To beginwith, the tools for programming and performing experiments are described in theresources. Next, the sources from which the data was acquired are described in sec-tions 4.2 and 4.4. The data sources section also describes the problems which wereencountered during the process of acquiring the data. Furthermore, the methods forprocessing the available data sets by extracting useful features and cleaning up thedata are explained in sections 4.3 and 4.5. The prediction model and its differentcomponents for determining the predictive ability of student data is discussed insection 4.6 and is followed by a description of the methods used for finding in-teresting correlations between features. The last section of the research methodsdiscusses how the different ideas for visualizations were obtained.

4.1 Resources

The Python programming language (version 2.7) which contains support for read-ing data files was used for applying clustering to data with Scikit-learn, creatingvisualizations and application development. The WEKA machine learning envi-ronment was used to test the performance of machine learning algorithms that wereunavailable in Scikit-learn (supervised learning and feature selection algorithms),but also to evaluate the results from applying clustering. WEKA was also utilizedto perform data analysis and find correlations in the data by making use of itsvisualizations, creation of correlations matrices and feature selection algorithms.Matplotlib and Pandas were worked with to create visualizations in Python, be-cause they allow for a large amount of customization e.g. different kinds of graphs,custom colors and text positioning. Finally, the Pyside/PyQT libraries were used tocreate a dashboard as a demo for showing visualizations and other features whichcould be included in such an application.

4.2 Data set 1

Data set 1 consists of student data from an application called Coach. Studentsuse Coach to carry out assignments online that are mandatory for passing a cer-tain course e.g. mathematics or biology courses. The received data belonged to acourse from the psycho biology curriculum which required students to carry outbasic mathematics assignments on Coach. Performing experiments on this Data set1 first would present the opportunity to become familiar with working on student

6

data sets and to test how well certain algorithms perform on different sets of stu-dent data.The courses followed on Coach had certain rules and a structure that was presentedto the students while working with the application. First of all, students could login at home or at the university to practice using questions relevant to the part of thecourse they were working on. Students could attempt to score on these questionsor simply get the lowest score in exchange for all the answers. The questions pro-vided different examples of the same terms, which gave student the opportunity toget tips and direct feedback on their answers. The assessments required studentsto carry out similar exercises as presented in the questions, but these assessmentswere obligatory for passing the course. The assessments consisted of randomizedexercises on the same topic. Once an exercise had been answered correctly the stu-dent would not have to make another attempt at this specific exercise. The studentscould have as many attempts at an exercise as they wanted, which means that theirassessment score could only go up. Students were not obligated to first practiceand then carry out assessments. Students were however required to pass each ofthe assessments with a minimum score of 80 % before the deadlines.Data set 1 contains two separate sets where one contains the grades that studentsachieved and the other set contained about 80000 entries of student activity withfeatures such as type of activity, student id and timestamps (full list of features isdescribed in data processing 4.3).

4.3 Data processing (Data set 1)

The Data set 1 was received as two Pickle files which both contain a python list ofdictionaries. In the activity data set the dictionary items each represent a statementthat was recorded by Coach and in the second data set the dictionaries map from aanonymous student identification number to the course grades. The features fromthe Pickle files are described below, followed by the section explaining the dataprocessing that was applied.

4.3.1 Description of the activity and grade data set

The grade data set contains grades and IDs of 303 students that participated in thepsycho biology course for which assignments were carried out on Coach. All thefeatures from the grade data set are listed in table 1.

7

Table 1: A table containing all the features from the grade data set.

Table 2: A table containing all the features from the activity data set.

The activity data set contains recorded activity of each student for every exercise orassessment that was carried out on Coach by the student during the psycho biologycourse. Assessments only contain a launched and completed entry, which indicateswhen an assessment entry was launched and when it was completed. Questions

8

contain a launched entry and can contain multiple completed entries which in-dicates when certain exercises within a question were completed. The data alsocontains a small amount of media item entries (these could be videos or interactiveimages), but they do not contain a score. The student IDs were made anonymousby using a randomly generated identification number instead of their student emailaddresses. By cross-matching the student IDs in both data sets it was possible tocreate a data set containing new features for every student. Each of these featureswould have to be calculated from the 80000 entries, which can result in long wait-ing times depending on the complexity of the calculations. Data processing of Dataset 1 was implemented with Python, which included the reading of the data sets,calculations and the creation of a new data set. All the features from the activitydata set are listed in table 2.

4.3.2 Extracted features from Data set 1

In total the data set consists of 278 features which can be used for data analysis andfor creating a prediction model. These features can be generated by either using theentire data set with one month worth of data or by selecting a time period (week,two weeks or a month) which result in the same features with the amount from thattime period. The generated data sets were saved in CSV format, which allows forefficient read and write operations. A list of the extracted features included in thecreated data set are listed in table 3.While trying to extract features from Data set 1 some problems were encountered.One of the problems is that a student can enter assignments and get all the answerswithout making a legitimate attempt at a decent score. This makes it difficult tofilter the fake attempts from the real attempts when trying to calculate the averagescore of students. Since Coach can register an attempt where students receive theanswers directly from the application, it would be ideal if this would somehow bedisplayed in the data set. Another problem is that there seemed to be some datamissing due to some of the students still having grades while having no availableactivity data. Furthermore, the activity entries from the same assignments can notbe linked to each other with any sort of ID, because the ID supplied by the datais randomized for every entry. The randomized activity ID makes it difficult todetermine how much time a student spent on an assignment.

9

Table 3: A table containing all the extracted features from Data set 1 which wasused in experiments with the prediction model, feature analysis and visualization.

4.4 Data set 2

The Blackboard data (Data set 2) became available at a later stage within theproject. Data set 2 was supposed to be extracted directly from the UvA database byusing SQL commands. An effort was made to team up with Blackboard ICTS (ICTServices) of the UvA to get Data set 2, but due to several problems encounteredduring the process of extracting data it was not possible to get a complete data set(2013-2015) of the three requested courses on time. The Blackboard database usesthe Oracle SQL language to manage and update all of its data, so to extract this datawe had to create and test specific SQL queries. Blackboard ICTS section utilizesthree separate databases: production (Blackboard uses this version), development(works with the newest version of the database and with test data) and acceptance(has less data functionality and is used for testing queries). Testing a query wouldfirst take place in acceptance to clean any bugs or errors from the query. To retrievethe desired data the tested query had to be used on the production set, but the querycould still end up retrieving the wrong data because of some internal differences

10

between acceptance and production.Progress on extracting data was achieved best when direct interaction with ICTSBlackboard was possible, but this couldn’t be achieved until later in the project.While extracting the data one of the main problems was the names of data columnsand rows in the database (back-end) which did not correspond with the names visi-ble on Blackboard itself (front-end). The people at ICTS called this problem ‘boot-strapping’ which made it difficult to retrieve or find input commands or entriesin the SQL database given on Blackboard. There were also some smaller problemswhich caused delay such as unexpected names for data features and the fact that re-trieving data from the production set took 20 minutes to complete. Near the end ofthe project a bug was found which indicated that Blackboard assigned the wrongtimestamps to clicks which made the timestamps unreliable when a sequence ofclicks has to be determined. These were some of the problems which caused theretrieval of Data set 2 to be delayed.Even with the aid from ICTS Blackboard who work directly with the Blackboarddatabase, retrieving the required data took nearly four weeks. In that period of timeanother method for extracting the data was found. This method used data exporttools from Blackboard to extract data directly from the database to create a sta-tistical overview of student data. The new method was time consuming and didnot provide data from previous years, but allowed for some experimentation withactual Blackboard data while the work continued on retrieving the entire data set.Data set 2 consists out of recorded clicks which a student performed on Blackboardduring a course, and results (grades and scores) from assignments that he/she wasrequired to turn in. Since the students could carry out multiple attempts to passan assignment there was also data available that contained information on thoseattempts. The final grades that the students achieved for a course were also madeavailable (full list of features is described in data processing section 4.5).

4.5 Data processing (Data set 2)

As mentioned previously, an alternative method (Blackboard export tools) wasused to extract data from Blackboard and manually copy the results of the extrac-tion in an excel sheet. This data only contained pre-processed information of ap-proximately 50 students from the three available courses, with no separated record-ings of individual clicks or activity which could normally be obtained directly fromData set 2. The three courses had a similar structure for grading, but each possesseddifferent assignments, which caused the data to be split in three smaller sectionsfor analysis. It should also be noted that the same students could follow multiplecourses.

11

Table 4: A table containing all the extracted features from Data set 2 which wasused in experiments with the prediction model, feature analysis

4.5.1 Description of the exported Data set 2

The exported data from Data set 2 contained information of 50 students from threedifferent courses which took place during two months: Course 1 (26 students),Course 2 (17 students) and Course 3 (11 students). In total there were 21 featuresextracted, but the amount can vary based on the amount of assignments that acourse could contain. A list of the features extracted from Data set 2 can be foundin table 4.During the process of manually extracting the data from the exported files someproblem were encountered. One of these problems is that the loss of data by humanerror when manually extracting the data from the export files is likely. Anotherproblem is that the pre-processed calculations which the Blackboard tool uses tocreate its data reports can’t be reviewed, so it is not possible to determine whetherthe generated data is reliable or correct. Generating the data and manually usingExcel commands to retrieve the useful data is also a time consuming process.

4.6 Prediction model

The prediction model applies machine learning techniques on the student data forthree different purposes: to discover in which way we can create student categories,to find the features which contribute the most to predicting students performanceand to demonstrate the performance of supervised learning techniques on the UvAstudent data. It will be possible to create this prediction model for different timeperiods of student data to analyse if the model can still produce the same accuracy

12

with a few days of student data instead of with the entire month of student datawhich is available. If the model performs well on smaller time periods it could po-tentially be used in an LA application for predicting student performance at anypoint during course. The data from the first five days of the course (just before theentry exam was held) and the data from the entire month was used in the experi-ments.The prediction model consists of clustering, feature selection and classification.Clustering with k-means will be discussed in section 4.6.1, a description of thefeature selection process can be found in section 4.6.2 and the classification pro-cess is explained in section 4.6.3.

4.6.1 Clustering

The k-means algorithm (with Euclidean distance metric) was used to categorizestudents in different groups based on their data from one of the previously de-scribed data sets. Student categories or clusters can be based on different featurese.g. amount of activity, amount of hints used, achieved grades and scores. In thisproject the students were divided in clusters based on their overall performance inboth the entry exam and the final exam. The experiments were performed by split-ting the students in two categories (low and high performance) and three categories(low, average and high performance). More student categories can also be acquiredby using a higher variety of features when clustering, for example, students couldalso be categorized based on their performance and their amount of activity.The WEKA learning environment was used to give a preview of how k-meanswould perform and to investigate if these groups give us any information about thedata when a clustering was applied with more features. The scikit-learn module ofPython was used to recreate the results from WEKA and add the labels for eachcategory to the student data. Students with fake (for testing) or missing data wereremoved from the data set to prevent them from adding noise. Once the cluster-ing process is finished and a data set with labeled students is acquired, the featureselection process can be applied.

4.6.2 Feature selection

Both WEKA and scikit-learn provide support for a number of different featureselection algorithms. WEKA provides more options narrowing down the specificfeatures which contribute the most to improving the classification process (rank-ing), so WEKA was utilized for performing feature selection on the data. In WEKAthe results of the CfsSubsetEval feature selection algorithm were used to determinewhich features would be selected for classification. CfsSubsetEval algorithm op-

13

tion in WEKA evaluates the predictive ability and degree of redundancy in a subsetof features from which the algorithm can then determine which subsets of featureshas the highest correlation with the class (categories). Using CfsSubsetEval it ispossible to find the best available subset of features for predicting the performancecategory of a student. InfoGainAttributeEval was used to evaluate and rank theworth of individual features by measuring the information gain with respect to theclass/category. Information gain means the reduction in uncertainty about the valueof the class when we know the value of a feature. Feature selection was performedon both two and three performance categories of students, because a change in theamount of categories (two or three) can also change the correlation of the subsetwith the classes (information gain changes as well). Data set 1 contains two examgrades on which the categories are based, so feature selection was tested for boththese grades as well. Once the ideal subset of features was found for both two andthree categories, the performance of different classification algorithms with thesesubsets could be tested.

4.6.3 Classification

Classification was performed with different supervised learning algorithms on bothtwo and three performance categories of students to analyse which of the algo-rithms performed best with the selected subset of features. WEKA includes a widerange of classification algorithms, but a choice had to be made on which algo-rithms would be used. Based on research from Ramaswami (2014), which includesan analysis of the predictive performance ability of the classification algorithmsfor educational data, neural networks and decision trees (in WEKA MultilayerPer-ceptron and J48) were chosen to be included in the experiments. Other algorithmsthat were included for classification are multiclass logistic regression (logistic inWEKA) and Naive Bayes (NaiveBayes in WEKA). 10-fold Cross validation wasapplied to make sure the results of each classifications report (precision and recallcalculations) would also apply to independent data. As mentioned previously, theclassification algorithms were tested on both two and three classes with the appro-priate subset of features. If the amount of instances of a certain class is low as com-pared to other classes, the SMOTE algorithm is applied to balance the amount ofclass instances before applying feature selection (SMOTE: minority class is over-sampled by creating synthetic examples). The accuracy of the different algorithmswas not only tested for the official period of time allocated for the course, but alsowith only 5 days of student data to test the performance of the algorithms with lessdata (the data for different time periods was only available in Data set 1).

14

4.7 Feature analysis

Feature analysis was carried out to find correlations in the data which can help iden-tify student learning strategies or identify features which have almost no relationto other features. WEKA allows for feature analysis with its visualization sectionin which all the features of a supplied data set can be plotted against each other in2d graphs. Applying feature selection (PCA) on the labeled data set will also resultin the unsupervised correlation/covariance matrix which can be used to detect cor-relation between features (instead of between features and a class). The goal of thedata analysis section of this project is to try to find interesting correlations usingthe WEKA visualization and correlation matrix functionality.

4.8 Visualization

Since it was not possible to test visualizations on students with their own data, onlysome ideas for visualization were conceived based on the results of this project andprevious research on visualizations. The visualizations should enable or facilitatethe process for a student to evaluate if his/her performance has improved. Visual-ization can help improve the efficiency of certain tasks by showing what kind ofprogress was achieved and comparing it with the progress of others. Furthermore,a dashboard was designed and created based on previous research and the dataavailable. This dashboard will provide a demonstration of how such an applicationcould look and of the functionality it will need to provide.

15

5 Results (Data set 1)

5.1 Clustering results

The k-means clustering algorithm was used to split the students in different cate-gories based on their performance on either the entry exam and on the final exam.The Scikit-learn Python library implementation of k-means was used to divide thestudents into categories and create a labeled data set file in CSV format. Scikit-learn k-means uses a random state to initialize the centers of the clusters, this valuewas set to zero to make sure the clusters have the same starting point. The resultsfrom k-means can be found in table 5.From the clustering example in table 6 it is possible to extract the following in-formation about average feature values of the created categories: Cluster or group0 contains students with above average scores and final grade, but under averageactivity and time spend on Coach. Group 1 contains the largest group of studentsand has under average values for every feature. Cluster 2 consists of student whoon average spend a lot of time on the assignments, but this can’t be reflected onthe final grade or scores. In cluster 3 students have above average activity and highperformance on scores and grades. Cluster 4 contains students who have extremelylow performance on the final grade as well as under average time spend and amountof activity. Finally, student in group 5 have a lot of activity on Coach, but performunder average.Based on the results from clustering on Data set 1 it can be concluded that the k-means is a solid option for categorizing students for data analysis. Table 6 demon-strates that k-means clustering clearly reveals the student categories. The featureswhich distinguish categories from each other, e.g. performance and activity, canbe discovered as well. Features that do not differentiate between categories stayrelatively the same for each category e.g. average assessment score for example.The results from clustering students by performance for 2 and 3 categories, shownin table 5, were used for experimenting with feature selection (section 5.2) andperformance classification (section 5.3). In addition to clustering students on theirperformance, using k-means with several other features can create different groupsthat can provide valuable information on the data. In table 6 an example of howother features can be used to create different student categories is shown.

16

Table 5: The results of clustering the students in different performance categories.The amount of students in every cluster and the average grade of each cluster arelisted.

Table 6: The results of clustering on multiple features.

17

5.2 Feature selection results

The results of finding the best subsets of features for performance prediction canbe found in table 7 and feature ranking on information gain in table 8.The feature selection subsets for Data set 1 varied in which kind of features theycontained. Overall, it is difficult to conclude anything from the subsets alone, thisis why it was important to rank features on their information gain with respect tothe categories. From ranking features with WEKA it was discovered that featuresselected for predicting the final exam performance have really low informationgain and features selected for entry exam have relatively high information gain.From these results it can already be concluded that it will be difficult to predict theperformance of students on there final exam with the available features.

Table 7: Contains five or less features from the selected subsets for two (top ta-ble) and three (bottom table) student performance categories. The total amount offeatures in each subset is listed as well. The WEKA function CfsSubsetEval (Best-First) was used to create these subsets.

18

Table 8: Top five ranked features for two (top table) and three (bottom table) studentperformance categories. The WEKA function InfoGainAttributeEval (Ranker) wasused to create these rankings.

5.3 Classification results

The results of classification with Data set 1 is listed. Features selection and SMOTEwere already applied to the data before performing these tests. Applying featureselection resulted in the best subset of features which performed best for predict-ing the performance class. With the help of feature selection the performance ofthe supervised learning algorithm increased. When splitting the students into threecategories the low performance category has a relatively low amount of instanceswhich could cause problems when performing classification. Oversampling withthe SMOTE algorithm helped with creating more instances of low performancestudents. The effects of both feature selection and oversampling on the perfor-mance of one of the supervised learning algorithms can be found in table 10. Theentire performance test for all the experiments can be found in table 9.

19

Table 9: Performance ranking to test the predictive ability of the supervised learn-ing algorithms for different time periods and different amounts of performancecategories (Data set 1).

The results from classification (table 9) show that some of the features from Dataset 1 can be used to predict students performance for the entry exam with rela-tively high accuracy (70% - 90%) on both two and three categories with the testedalgorithms. The algorithms perform poorly (40% - 60%) for trying to predict theperformance of students on the final exam with both two and three categories. Thiswas expected due to the low information gain recorded from the feature selectionexperiments. Possessing the average score of all attempts and amount of activityfrom the entry exam in the data (features associated with Entreetoets in the table)helps with achieving high accuracy for predicting performance on the entry exam,so removing these features will result in a significant drop of accuracy with val-ues close to those of the final exam predictions. The difference in performancecould also be associated to the final exam being in a different format or being moredifficult to carry out than the assignments from Coach. Accuracy for predicting per-formance based on the final grade is lower when using data from 5 days instead of

20

the entire month, which is likely due to the decrease in data when utilizing shortertime periods. Neural network and decision tree algorithms showed the highest per-formance on most of the tests that were performed.

Table 10: A demonstration of how the performance of a supervised learning al-gorithm can improve by utilizing SMOTE (oversampling) before applying featureselection and classification. The experiment was performed with the decision treealgorithm, on 5 days of data, with three performance categories based on the entryexam.

21

5.4 Feature analysis results

A correlation matrix as seen in table 11 was created to discover correlations be-tween the features from Data set 1. Based on the correlations from the matrix somevisualizations were found in WEKA which could emphasize these findings (exam-ple in figure 12).

Table 11: Correlation matrix of the features from Data set 1 (excluding the other254 feature from the amount of activity for every assignment and the average scorefor every assignment).

22

Figure 12: Example of WEKA visualization: demonstrates the negative correlationbetween average assessment score (x-axis) and amount of time spent on assign-ments (y-axis) for Data set 1. Blue color indicates above average performance onthe final exam and the red colour shows under average performance.

Feature analysis on Data set 1 did result in some unexpected findings, e.g. lowcorrelation of total time spend with average score. The noticeable correlations fromthe correlation/covariance matrix (figure 11) created in WEKA from features of303 students will now be listed:

• The days between first recorded activity and last recorded activity and totaltime spent on Coach assignments have low correlation with other features(under 0.30 and above -0.10).

• The number of questions not completed has relatively high positive correla-tion with the average question score (if the amount of not completed ques-tions increases the average question score increases as well).

• Positive correlation between scores for assignments and time spent on as-signments is low. If students spent a lot of time on their assignments theiraverage question and assessment score will be high, but students with a lowamount of time spent on the assignments can still have a high average score(figure 12).

• When an increase in number of assessments carried out occurs, the averageassessment score decreases (negative linear correlation).

23

• Other correlations are present as well e.g. positive correlations between amountof activity (amountOfLogs) and number of launched exercises, questions andmedia items.

6 Results (Data set 2)

6.1 Prediction model results

The same prediction model as for Data set 1 was applied to Data set 2. The modelwas tested on the data from Course 1 and Course 2 since these still had a moderateamount of instances compared to Course 3. Clustering resulted in two and threeperformance categories based on the final grade students had obtained. The perfor-mance categories were used for feature selection (table 13) and classification (table14).

Table 13: The table at the top shows the selected subsets of features by WEKACfsSubsetEval (BestFirst) for Course 1 and the bottom table shows the selectedtop five ranked features based on their correlation with the class calculated withInfoGainAttributeEval (Ranker) for Course 1.

24

Since the courses all contain similar content and grading mechanics on Blackboard,the prediction model tested on Course 1 and Course 2 can also be applied to Course3. Only data from the entire course was available which meant that it was not pos-sible to test the prediction model on smaller a period of time e.g. a week instead ofthe entire two months of the course. The feature selection results from testing theprediction model on the Course 1 data demonstrate that features have a high infor-mation gain and correlation with the performance class, whereas the informationgain with the final grade performance class for the features from Data set 1 wassignificantly lower. Features such as the amount of clicks made on Blackboard,total assignment score and total attempts at assignments show high informationgain with the performance class. The high information gain from the features fromCourse 1 is reflected in the performance of the classification algorithms which per-formed more accurate compared to the tests on Data set 1.

Table 14: The performance ranking for the predictive ability of the supervisedlearning algorithms on Course 1 and Course 2 (Data set 2). 10-fold cross valida-tion was applied to generate more reliable results as well as performance enhancingtechniques such as feature selection and SMOTE.

25

6.2 Feature analysis results

A correlation matrix was created for Data set 2 for Course 1 and Course 2 to dis-cover correlations between features. Some of the interesting correlations resultedin insightful images from WEKA (figure 15 and figure 16). The correlations willbe discussed in a later section.No unexpected correlations were found during the analysis, but confirming someof the correlations can be important for presenting meaningful information data tostudents. The following correlations were found:

• High positive correlation of total assignment score with time spend on Black-board and total amount of clicks.

• All features have positive correlations between each other (if the value ofone feature increases the value of the other feature will also increase).

• Other correlations are present as well e.g. positive correlations amount ofclicks and time spend on Blackboard and number of attempts at assignments.

Figure 15: Example of WEKA visualization with Course 1 data: demonstrates thecorrelation between total time spent on the Blackboard part of the course (y-axis)and the total score achieved for the assignments (x-axis). Blue color indicates stu-dents with above average performance for the entire course (final grade) and thered colour shows students with under average performance.

26

Figure 16: Example of WEKA visualization with Course 1 data: demonstratesthe correlation between the total assignment attempts (y-axis) and the total scoreachieved for the assignments (x-axis). Blue color indicates students with aboveaverage performance for the entire course (final grade) and the red colour showsstudents with under average performance.

7 Visualization

7.1 Assign visualizations based on predictions

Based on the different student categories created by the prediction model it shouldbe possible to assign created visualizations based on the selected categories. Someideas for visualization based on student categories were investigated, but could notbe tested on students since the resources (e.g. access to Data set 2) and students forsuch experiments were not available. This means that the visualizations are basedon previous research done in LA and experience from extracting information fromstudent data with data analysis and machine learning. Further research on assigningvisualizations based on performance would require students to evaluated each ofthe created images to investigate if the images help students with improving theirlearning strategies. Based on previous research, it is essential to have a group stu-dents to evaluate each of the created visualizations for investigating what kind ofvisuals and interactions students would prefer (see Govaerts et al., 2010, p. 7-9).In general it will be important to provide students with visualizations with whichthey can put themselves in context with other students. Improving student per-formance by raising performance awareness and encouraging better performanceis a viable option for students who are under performing (Fritz, 2011). Showing

27

where a student resides in some kind of distribution of data , e.g. with performanceand time spent, or showing what students are actively working on compared toother students, are examples of how students could compare themselves to others.Another approach could be to show student performance rankings among otherstudents which might increase student motivation through competition. Showing astudent a top 5 or top 10 of students performances or predictions of performancesin a course next to his own data could also give the student an idea of where thereis room for improvement. In addition, it could be also be beneficial if the studentswere provided with visual indicators of when their activity or other data fell belowa certain threshold associated with the student earning a grade they desired/needfor passing the course. Figure 17 and figure 18 are two examples of possible visu-alizations.

Figure 17: Visualization example of visual indicators of when a student’s activityfalls below a certain threshold. The amount of activity (y-axis) is plotted against thedates on which the activity took place (x-axis). The green line shows the averageactivity of all students and the blue line shows the activity of one selected student.(Data set 1 was used to create this visualization)

28

Figure 18: Visualization example which shows the assignments a student has beenworking on (green) and the assignments that all the other students worked on dur-ing the same dates (red). The different assignments are plotted (y-axis) against thedates on which they were performed (x-axis). (Data set 1 was used to create thisvisualization)

29

7.2 Alternative approaches to data visualization

In the visualization results some ideas for visualizations for specific student cat-egories were described, but other approaches for providing students with usefulvisualizations are also possible. Similar to assigning visualizations with a predic-tion model, these approaches will also need students for evaluation.Allowing students to find visualizations on their own is an example of an alter-native approach. Instead of assigning visualization based on student performance,this approach will provide visual support to students in the form of multiple visualrepresentations of the data. Providing students with many visualizations to choosefrom will enable them to find correlations in the data by themselves and, in the pro-cess, improve their learning strategy and learn about the strategies of their fellowstudents (Tory and Moller, 2004, p. 74-75).Another idea for future research is to create visualization based on the type, amountor domain of the data. In the research paper Klerkx et al. (2014) an overview isgiven of research papers that have investigated the appropriate visualization tech-niques for a number of different data types. For future work it could be researchedif this approach could be automated for LA dashboards to provide students withthe appropriate visualizations based on their data.

7.3 GUI/Dashboard

A GUI/Dashboard was created with Python using the Pyside/PyQt library. Thisdashboard is meant to give a demonstration of how a LA application could lookand what functionality this application could possess. The current implementationonly works with Data set 1, but it would be possible to implement a version thatworks with Data set 2. When you start the dashboard, a pop up will become visiblethat asks for a student number. Once the student number is selected the dashboardwill show a window with the data (visualizations) of the selected student. At thetop of the window there are options for selecting another student or a different timeperiod (1 week, 2 weeks, 1 month etc.). The application does not use the predictionmodel that was described earlier, but this can be implemented using Scikit-learn.At the bottom left the student can see a graph of his/her activity and the averageactivity of all students on any date, with a slider for selecting either a time segmentor the entire set of dates available (minimum is a week). A few bar-graphs canbe selected at the bottom right containing information about the grades, activityand exercises of a student. This enables the student to compare his/her values tothe average of all the students. The top right contains the available informationof a student. The top left segment currently holds some placeholder images, butin the future it could hold the recommended images for a student. The dashboard

30

gives a simple example of an application on which ideas for visualizations couldbe evaluated by students in future research (figure 19).

Figure 19: The GUI of the dashboard as described in section 7.3

8 Conclusion

In this paper we proposed methods for extracting meaningful information fromstudent data by using LA and EDM techniques. Through testing these methods ontwo different data sets it was attempted to show what kind of information could beextracted and how this information could be presented to students. In the first stageof the research important data features were extracted from Data set 1 and Data set2, to create data sets on which the data analysis and machine learning techniquescould be tested.

31

By testing a predictive model for classifying student performance on Data set 1and Data set 2, it was found that the model only showed promising performanceon Data set 2. The predictive ability of any of the available features from Data set1 was too low to predict which kind of grade a student would achieve at the end ofa course. This result is in contrast with the results obtained from testing on Dataset 2 which contained features with high information gain with respect to the class.Clustering as well as feature selection from the prediction model showed promis-ing results for either categorizing students or for investigating which features areimportant in the data. Overall, it can be concluded that the predictive ability ofData set 1 and its features is low and that the prediction model demonstrated a bet-ter performance when tested on Data set 2.The results of feature analysis showed that with the use of correlation matrices andvisualizations we can identify interesting correlations in the data. Most of the cor-relations found in the Data set 1 and Data set 2 were not unexpected (e.g. increasein activity/clicks also shows an increase in time spent on Blackboard/Coach), butit was important to confirm if these kind of correlations can be extracted from thedata because they can each provide information about the learning strategies ofstudents.Different ideas were presented for providing students with visualizations fromwhich they can extract information that helps improve their learning. It was ex-plained that students could potentially learn a lot from being able to see a visualrepresentation of their data in context with other students, and also by receivingvisual feedback on when a student was performing under average. Although a lotof ideas for visualizations were listed, most of these ideas will have to be evaluatedby students in future experiments to confirm that they can help improve learning.The findings from this research show that it was successful in providing methodsfor extracting meaningful information from student data and demonstrating ideasfor presenting this information to students. However, further research will have tobe performed into applying these findings to a LA application for assisting studentsin improving their learning strategies.

9 Discussion and future work

The problems with extracting Data set 2 described in section 4.4 caused the dataretrieval and analysis to be delayed. The data which was eventually extracted fromthe Blackboard database was incomplete, which is why the research methods wereapplied to the exported data set instead. Although not all the desired Blackboarddata was available at the time of testing, the problems encountered during the dataextraction process have been listed in this report and by the people working at

32

Blackboard ICTS as well, which will likely facilitate the process for future re-search. The exported data set allowed for performing experiments on Data set 2,but it is advised for future research to not use this data set because extracting it istime consuming and prone to human error.Once all the desired Blackboard data is included in Data set 2 it will be possi-ble to make predictions with the data from previous years (not just 2015). It willalso become possible to apply the prediction model on a smaller time period aswas achieved with Data set 1. For feature analysis it will be important to use thecomplete Blackboard data to evaluate whether discovered correlations still applyto data from previous years. Furthermore, a more detailed analysis could be per-formed in the future, which could include a T-tests or other tests of significance onthe data. In addition to the existing student data currently saved from Blackboard,it could be beneficial to add the attendance of students to blackboard since thissupplies additional information which could be beneficial to predicting a studentsperformance or for other analysis.When the complete version of Data set 2 is acquired the correlations and otherinformation extracted from this data could be used to create informative visualiza-tions for improving learning of students. However, it is also important to determinewhat kind of visualizations can actually provide the student with meaningful in-formation. Therefore, it will be necessary for future research to test the ideas forvisualizations on students and use these ideas to create a dashboard applicationbased on the feedback retrieved from students.

33

References

Fritz, J. (2011). Classroom walls that talk: Using online course activity data of suc-cessful students to raise self-awareness of underperforming peers. The Internetand Higher Education, 14(2):89–97.

Govaerts, S., Verbert, K., Klerkx, J., and Duval, E. (2010). Visualizing activitiesfor self-reflection and awareness. In Advances in Web-Based Learning–ICWL2010, pages 91–100. Springer.

Klerkx, J., Verbert, K., and Duval, E. (2014). Enhancing learning with visualiza-tion techniques. In Handbook of Research on Educational Communications andTechnology, pages 791–807. Springer.

Marquez-Vera, C., Cano, A., Romero, C., and Ventura, S. (2013). Predicting stu-dent failure at school using genetic programming and different data miningapproaches with high dimensional and imbalanced data. Applied intelligence,38(3):315–330.

Papamitsiou, Z. and Economides, A. A. (2014). Learning analytics and educationaldata mining in practice: A systematic literature review of empirical evidence.Journal of Educational Technology & Society, 17(4):49–64.

Ramaswami, M. (2014). Validating predictive performance of classifier modelsfor multiclass problem in educational data mining. International Journal ofComputer Science Issues (IJCSI), 11(5):86–90.

Santos, J. L., Verbert, K., Govaerts, S., and Duval, E. (2013). Addressing learnerissues with stepup!: an evaluation. In Proceedings of the Third InternationalConference on Learning Analytics and Knowledge, pages 14–22. ACM.

Shovon, M., Islam, H., and Haque, M. (2012). An approach of improving studentsacademic performance by using k means clustering algorithm and decision tree.(IJACSA) International Journal of Advanced Computer Science and Applica-tions, 3(8):146–149.

Siemens, G. (2013). Learning analytics: The emergence of a discipline. AmericanBehavioral Scientist, pages 1–21.

Tory, M. and Moller, T. (2004). Human factors in visualization research. Visual-ization and Computer Graphics, IEEE Transactions on, 10(1):72–84.

34

utilization of learning analytics to obtain … · teresting correlations between features. the...

Documents