eindhoven university of technologympechen/projects/pdfs/boer2010.pdf · table 3.12: data with...
TRANSCRIPT
EINDHOVEN UNIVERSITY OF TECHNOLOGY
Table 1.1: Goals of the data mining on educational data.
Table 3.1: Distribution of students with their course count.
Table 3.2: Validation and cleaning with validation window (2* standard
deviation).
0
20
40
60
80
100
120
140
3 4 5 6 7 8
Years
Stu
den
ts
Figure 3.5: Distribution of students over study time.
Table 3.3: Validation and cleaning with 3* standard deviation window.
0
20
40
60
80
100
120
140
160
180
1 2 3 4 5 6 7 8 9 10 11
Years
Stu
de
nts
Figure 3.6: Distribution of students over
study time.
Table 3.4: Validation and cleaning with 3* standard deviation window on
bootstrapped data.
0
20
40
60
80
100
120
140
160
2 3 4 5 6 7 8 9
Years
Stu
dents
Figure 3.7: Distribution of students over
study time.
Table 3.5: Validation and cleaning with 2* standard deviation window on
bootstrapped data.
0
20
40
60
80
100
120
140
4 5 6 7
Years
Stu
dents
Figure 3.8: Distribution of students over study
time
Table 3.6: Validation and cleaning with 2* standard deviation window.
0
10
20
30
40
50
60
70
4 5 6 7 8
Years
Stu
dents
Figure 3.9: Distribution of students over
study time
Table 3.7: Validation and cleaning with 3* standard deviation window.
0
10
20
30
40
50
60
70
80
90
2 3 4 5 6 7 8 9 10
Years
Stu
dents
Figure 3.10: Distribution of students over
study time
Table 3.8: Validation and cleaning with 2* standard deviation window.
0
10
20
30
40
50
60
70
4 5 6 7 8
Years
Stu
dents
Figure 3.11: Distribution of students over
study time
Table 3.9: Validation and cleaning with 3* standard deviation window.
0
10
20
30
40
50
60
70
80
90
3 4 5 6 7 8 9 10
Years
Stu
dents
Figure 3.12: Distribution of students over
study time
!767096.4
767096.4
"5.832469
Figure 3.13: Determining of short, normal and long classes
!5.107548
5.107548
"6.34147
Figure 3.14: Determining of short, normal and long classes
!4.554752
4.554752
"6.374105
Figure 3.15: Determining of short, normal and long classes
!109692.5
5.109692
"7.164413
Figure 3.16: Determining of short, normal and long classes
Figure 3.19: Courses with amount of students
Figure 3.20: Students with amount of courses
Figure 3.21: Courses with amount of students
Figure 3.22: Students with amount of courses
Figure 3.23: Courses with amount of students
Figure 3.24: Students with amount of courses
Figure 3.25: Distribution of results over the years
Figure 3.26: Amount of new students over the years.
Figure 3.27: Distribution of results over the years
Figure 3.28: Amount of new students over the years.
Table 3.10: Attributes with possible values
Table 3.11: Data with categorized attributes
Table 3.12: Data with ordinal attributes
FPTP
TPpecision
+=,Pr
FNTP
TPrcall
+=,Re
pr
F11
21
+
=
Definitions of accuracy metrics
Table 4.1: JRIP results on 2std, categorical attributes
Table 4.2: JRIP results on 3std, categorical attributes
Table 4.3: JRIP results on 2std, ordinal attributes
Table 4.4: JRIP results on 3std, ordinal attributes
Table 4.5: Class distribution over instances of 3std, ordinal attributes
Table 4.6: Rules with cost sensitive learning.
Table 4.7: Rules with cost sensitive learning.
Figure 4.1: ROC curve of long class, table A15
Figure 4.2: ROC curve of long class, table A16
Table 4.8: JRIP results on binary class with SMOTE.
Figure 4.3: ROC curve of class long from JRIP results on binary class with
SMOTE.
Table 4.9: Ridor results on binary class with SMOTE.
•
•
•
•
•
•
•
•
Wiskunde 2 Wiskunde 1 0.699 232 5.08
Wiskunde 1 Wiskunde 2 0.718 232 5.08
Inleiding functioneel progragrammeren
Systeemmodelleren 1
0.682 227 3,565
Wiskunde 2 Databases 1 0.81 269 3.542
Wiskunde 1 Databases 1 0.724 234 3.168
Systeemmodelleren 1 Databases 1 0.639 287 2.795
Operating systems Compilers 0.609 255 2.775
Compilers /\ Programmeren 1 Programmeren 2 0.769 250 2.334
Wiskunde 2 Programmeren 1 0.798 265 2.172
Automatentheorie en formele talen /\ Programmeren 2
Programmeren 1
0.763 267 2.076
Inleiding functioneel programmeren
Programmeren 2 0.679 226 2.059
Systeemmodelleren 1 Programmeren 2 0.675 303 2.047
Wiskunde 1 Programmeren 1 0.749 242 2.038
Compilers Programmeren 2 0.67 345 2.032
Basiswiskunde 3 Programmeren 2 0.654 231 1.985
Compilers /\ Programmeren 2 Programmeren 1 0.725 250 1.972
Automatentheorie en formele talen
Programmeren 1 0.717 420 1.95
Automatentheorie en formele talen /\ Programmeren 1
Programmeren 2 0.629 266 1.908
Implementatie Programmeren 2 0.628 245 1.906
Operating systems Programmeren 2 0.628 263 1.904
Databases 1 Programmeren 1 0.657 353 1.788
Programmeren 2 Programmeren 1 0.65 502 1.768
Compilers Programmeren 1 0.627 323 1.706
Systeemmodelleren 1 Programmeren 1 0.619 278 1.685
Table 4.10: Association rules found in results of students that where
insufficient on the first time they tried to pass the course.
Figure 4.4: Clustergram with study length and courses.
Table 4.11: Courses for which can be said that almost all students have a
good result.
Table 4.12: Course of the blue rectangle in figure 4.4.
Figure 4.5: JRIP classification rule used for emerging patterns.
Table 4.13: Support per year the rule of figure 4.5.
Figure 5.1: Process extracted from the first year of all students.
Figure 5.2: Process extracted from the second year of all students.
Figure 5.3: Process extracted from the third year of all students.
Figure 5.4: Process extracted from the fourth year of all students.
Figure 5.5: Process extracted from the fifth year of all students.
Figure 5.6: Process extracted from the sixth year of all students.
Algebra 2 (1.3 and 2.1)
Basiswiskunde 3 (1.3)
Basiswiskunde 2 (1.2)
Algebra 1 (1.2)
Implementatie (1.3 and 2.1)
Basiswiskunde 1 (1.1)
Programmeren 3 (2.1)
Operating systems (2.3)
Compilers (2.2)
Automatentheorie en formele talen (1.3)
Programmeren 2 (1.3)
Programmeren 1 (1.2)
Table 5.1: Result of Frequent Itemset Mining on all courses
Node filter:
Significance cutoff: 0.430
Edge filter:
Cutoff: 0.042
Utility rt: 0.582
Node filter:
Significance cutoff: 0.413
Edge filter:
Cutoff: 0.042
Utility rt: 0.583
Node filter:
Significance cutoff: 0.405
Edge filter:
Cutoff: 0.032
Utility rt: 0.217583
Figure 5.7: Fuzzy models of the 3 different study times on the most frequent
courses.
Heuristic model of students with a short study time.
Relative to best threshold: 0.05
Positive observations: 10
Dependency threshold: 0.9
Connected: yes
Heuristic model of students with a short study time.
Relative to best threshold: 0.05
Positive observations: 10
Dependency threshold: 0.8
Connected: no
Figure 5.8: Heuristic model of students with a short study time.
Heuristic model of students with a normal study time.
Relative to best threshold: 0.05
Positive observations: 10
Dependency threshold: 0.93
Heuristic model of students with a normal study time.
Relative to best threshold: 0.05
Positive observations: 10
Dependency threshold: 0.93
Figure 5.9: Heuristic model of students with a normal study time.
Heuristic model of students with a long study time.
Relative to best threshold: 0.05
Positive observations: 10
Dependency threshold: 0.9
Connected: yes
Heuristic model of students with a long study time.
Relative to best threshold: 0.05
Positive observations: 10
Dependency threshold: 0.9
Connected: no
Figure 5.10: Heuristic model of students with a long study time
Figure 5.11: Petri net of the courses given to students with the start year
2004.
Figure 5.12: Result of conformance checking the students that start in 2004.
Figure 5.13: Sequence mining results for short study time students on the
courses of table “Result of Frequent Itemset Mining on all courses”.
Figure 5.14: Sequence mining results for normal study time students on the
courses of table “Result of Frequent Itemset Mining on all courses”.
Figure 5.15: Sequence mining results for long study time students on the
courses of table “Result of Frequent Itemset Mining on all courses”.
Table A1: Table names of the data with English translation.
Table A2: Fields of table Address with English translation.
Table A3: Fields of table exams with English translation.
Table A4: Fields of table personal details with English translation.
Table A5: Fields of table results with English translation.
Table A6: Fields of table study packages with English translation.
Table A7: Fields of table study package participants with English translation.
Table A8: Fields of table preparatory educations with English translation.
Table A9: Fields of the table preparatory education courses with English
translation.
0
50
100
150
200
250
300
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
Year
Am
ou
nt
of
exam
s
Figure A2: The amount of exams per year.
Table A10: Different exam assessments with English translation.
Table A11: JRIP statistics after mining on 2std, categorical attributes
Table A12: JRIP statistics after mining on 3std, categorical attributes
Table A13: JRIP statistics after mining on 2std, ordinal attributes
Table A14: JRIP statistics after mining on 3std, ordinal attributes
Table A 15: JRIP statistics after cost sensitive mining on 3std, ordinal
attributes. V1
Table A16: JRIP statistics after cost sensitive mining on 3std, ordinal
attributes. v2
Table A17: Statistics of JRIP results on binary class with SMOTE.
Table A18: Statistics of Ridor results on binary class with SMOTE.
=== Run information ===
Scheme: weka.classifiers.trees.Id3
Relation: test-weka.filters.supervised.instance.SMOTE-C0-K5-P400.0-S1
Instances: 974
Attributes: 5745
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Id3
startYear = 1980: null
startYear = 1981: null
startYear = 1982: null
startYear = 1983: short-normal
startYear = 1984: long
startYear = 1985: long
startYear = 1986
| 2L500_firstResult<4 = -: long
| 2L500_firstResult<4 = 0
| | 1B053_firstResult<7 = -: null
| | 1B053_firstResult<7 = 0: long
| | 1B053_firstResult<7 = 1: short-normal
| 2L500_firstResult<4 = 1
| | 0Z060_trialCount>1 = -: long
| | 0Z060_trialCount>1 = 0: short-normal
| | 0Z060_trialCount>1 = 1: long
startYear = 1987
| 5F040_firstResult<7 = -
| | 0K060_trialCount>0 = -: long
| | 0K060_trialCount>0 = 0: null
| | 0K060_trialCount>0 = 1: short-normal
| 5F040_firstResult<7 = 0
| | 2K700_highestResult<8 = -: null
| | 2K700_highestResult<8 = 0: short-normal
| | 2K700_highestResult<8 = 1
| | | 2N010_highestResult<8 = -: null
| | | 2N010_highestResult<8 = 0: short-normal
| | | 2N010_highestResult<8 = 1
| | | | 1A060_trialCount>0 = -: short-normal
| | | | 1A060_trialCount>0 = 0: null
| | | | 1A060_trialCount>0 = 1: long
| 5F040_firstResult<7 = 1
| | 2F550_firstResult<3 = -: null
| | 2F550_firstResult<3 = 0: short-normal
| | 2F550_firstResult<3 = 1: long
startYear = 1988
| 2L711_firstResult<4 = -: short-normal
| 2L711_firstResult<4 = 0
| | 2M240_firstResult<8 = -: null
| | 2M240_firstResult<8 = 0
| | | 2M227_trialCount>1 = -: long
| | | 2M227_trialCount>1 = 0: short-normal
| | | 2M227_trialCount>1 = 1: long
| | 2M240_firstResult<8 = 1: short-normal
| 2L711_firstResult<4 = 1
| | 1B040_trialCount_np = -: long
| | 1B040_trialCount_np = 0: null
| | 1B040_trialCount_np = 1: short-normal
startYear = 1989
| 2L670_firstResult<7 = -: short-normal
| 2L670_firstResult<7 = 0
| | 2L140_trialCount>0 = -: long
| | 2L140_trialCount>0 = 0: null
| | 2L140_trialCount>0 = 1
| | | 0K060_trialCount>2 = -: null
| | | 0K060_trialCount>2 = 0: short-normal
| | | 0K060_trialCount>2 = 1: long
| 2L670_firstResult<7 = 1: short-normal
startYear = 1990
| 2L530_trialCount>0 = -
| | 2R707_firstResult<7 = -: short-normal
| | 2R707_firstResult<7 = 0
| | | 0L800_trialCount>0 = -: short-normal
| | | 0L800_trialCount>0 = 0: null
| | | 0L800_trialCount>0 = 1: long
| | 2R707_firstResult<7 = 1
| | | 2WS13_trialCount>1 = -: long
| | | 2WS13_trialCount>1 = 0: short-normal
| | | 2WS13_trialCount>1 = 1: long
| 2L530_trialCount>0 = 0: null
| 2L530_trialCount>0 = 1: short-normal
startYear = 1991
| 2Y420_trialCount>1 = -: short-normal
| 2Y420_trialCount>1 = 0
| | 2L340_firstResult<8 = -: short-normal
| | 2L340_firstResult<8 = 0: short-normal
| | 2L340_firstResult<8 = 1
| | | 2L500_trialCount>1 = -: null
| | | 2L500_trialCount>1 = 0: long
| | | 2L500_trialCount>1 = 1: short-normal
| 2Y420_trialCount>1 = 1
| | 2L085_firstResult<8 = -: null
| | 2L085_firstResult<8 = 0: short-normal
| | 2L085_firstResult<8 = 1
| | | 2L060_trialCount>2 = -: null
| | | 2L060_trialCount>2 = 0: long
| | | 2L060_trialCount>2 = 1: short-normal
startYear = 1992
| 2L060_trialCount>3 = -: null
| 2L060_trialCount>3 = 0: short-normal
| 2L060_trialCount>3 = 1: long
startYear = 1993
| 1B170_trialCount>6 = -: null
| 1B170_trialCount>6 = 0
| | 1B050_trialCount_np = -: short-normal
| | 1B050_trialCount_np = 0: null
| | 1B050_trialCount_np = 1: long
| 1B170_trialCount>6 = 1: long
startYear = 1994
| 2L711_firstResult<3 = -: short-normal
| 2L711_firstResult<3 = 0
| | 0K060_trialCount>1 = -: null
| | 0K060_trialCount>1 = 0: short-normal
| | 0K060_trialCount>1 = 1: long
| 2L711_firstResult<3 = 1: long
startYear = 1995
| 1J210_firstResult<6 = -: short-normal
| 1J210_firstResult<6 = 0
| | 1Z340_trialCount>2 = -: short-normal
| | 1Z340_trialCount>2 = 0: short-normal
| | 1Z340_trialCount>2 = 1: long
| 1J210_firstResult<6 = 1: long
startYear = 1996
| 2R237_highestResult<8 = -: null
| 2R237_highestResult<8 = 0
| | 2M004_trialCount>0 = -: long
| | 2M004_trialCount>0 = 0: null
| | 2M004_trialCount>0 = 1: short-normal
| 2R237_highestResult<8 = 1
| | 2Y380_highestResult<8 = -: short-normal
| | 2Y380_highestResult<8 = 0
| | | 2IN40_trialCount_np = -: short-normal
| | | 2IN40_trialCount_np = 0: null
| | | 2IN40_trialCount_np = 1: long
| | 2Y380_highestResult<8 = 1
| | | 2R077_firstResult<7 = -
| | | | 1A350_trialCount>1 = -: null
| | | | 1A350_trialCount>1 = 0: long
| | | | 1A350_trialCount>1 = 1: short-normal
| | | 2R077_firstResult<7 = 0
| | | | 1C200_trialCount_np = -: short-normal
| | | | 1C200_trialCount_np = 0: null
| | | | 1C200_trialCount_np = 1: long
| | | 2R077_firstResult<7 = 1: long
startYear = 1997
| 2io60_trialCount>0 = -
| | 1B170_firstResult<10 = -: null
| | 1B170_firstResult<10 = 0: long
| | 1B170_firstResult<10 = 1: short-normal
| 2io60_trialCount>0 = 0: null
| 2io60_trialCount>0 = 1: long
startYear = 1998
| 2IH20_trialCount>2 = -: null
| 2IH20_trialCount>2 = 0
| | 2M204_trialCount>0 = -
| | | 2M090_trialCount>1 = -: short-normal
| | | 2M090_trialCount>1 = 0: long
| | | 2M090_trialCount>1 = 1: short-normal
| | 2M204_trialCount>0 = 0: null
| | 2M204_trialCount>0 = 1: short-normal
| 2IH20_trialCount>2 = 1
| | 2M980_firstResult<7 = -: long
| | 2M980_firstResult<7 = 0
| | | 2M927_firstResult<6 = -: short-normal
| | | 2M927_firstResult<6 = 0: short-normal
| | | 2M927_firstResult<6 = 1: long
| | 2M980_firstResult<7 = 1: long
startYear = 1999
| 1C200_firstResult<5 = -: short-normal
| 1C200_firstResult<5 = 0: short-normal
| 1C200_firstResult<5 = 1
| | 2Y345_firstResult<8 = -: long
| | 2Y345_firstResult<8 = 0: short-normal
| | 2Y345_firstResult<8 = 1
| | | 2F540_trialCount>2 = -: long
| | | 2F540_trialCount>2 = 0: long
| | | 2F540_trialCount>2 = 1: short-normal
startYear = 2000
| 2M227_highestResult<7 = -: null
| 2M227_highestResult<7 = 0: short-normal
| 2M227_highestResult<7 = 1
| | 0L800_trialCount>0 = -: long
| | 0L800_trialCount>0 = 0: null
| | 0L800_trialCount>0 = 1: short-normal
startYear = 2001: short-normal
startYear = 2002: null
startYear = 2003: null
startYear = 2004: null
startYear = 2005: null
startYear = 2006: null
startYear = 2007: null
startYear = 2008: null
startYear = 2009: null
Time taken to build model: 11.38 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 858 88.0903 %
Incorrectly Classified Instances 112 11.499 %
Kappa statistic 0.769
Mean absolute error 0.1155
Root mean squared error 0.3398
Relative absolute error 23.2354 %
Root relative squared error 68.1697 %
UnClassified Instances 4 0.4107 %
Total Number of Instances 974
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.87 0.099 0.906 0.87 0.887 0.884 short-normal
0.901 0.13 0.863 0.901 0.882 0.884 long
Weighted Avg. 0.885 0.114 0.885 0.885 0.885 0.884
=== Confusion Matrix ===
a b <-- classified as
441 66 | a = short-normal
46 417 | b = long
Table A19: Statistics of ID3 results on binary class with SMOTE.
Table A20: Statistics of NaïveBayes results on binary class with SMOTE.