forecasting a student's education fulfillment using regression analysis
DESCRIPTION
Our government spends substantial amount of resources in educating our children. Additionally several welfare schemes are introduced aimed especially at underprivileged children to ensure that all of them complete a basic level of education. In spite of these measures many students do not complete their basic education. The aim of this project is to formulate a Supervised Learning Algorithm that will aid in identifying such students who have a higher likelihood of not completing their education. To perform this task the algorithm will perform Logistic Regression Analysis on historical data of students from a given school. The historical data includes basic background information (features) such as gender, community, number of siblings etc. It must be noted that the historical data also contains information on whether the student completed his/her education, which is the outcome we are interested in. Typically a student finishing education will be denoted using a value of 1 and a student not finishing will be denoted with a value of 0. Based on the training (historical) data a logistic classifier can be built. Such a classifier after learning from the training set will develop specific weightages for each of the features. These weightages can then be extrapolated into an equation that can be used for prediction. That is we can apply the equation on a current student (whose background we already know) to calculate the probability that he/she will complete his/her education. Such an algorithm will be beneficial to government agencies since it can serve as an early warning system using which they can take more proactive action to prevent a student from dropping out. Policy makers can also use it as a tool to identify schools that are more vulnerable and direct their resources and energies to help them.TRANSCRIPT
i
FORECASTING A STUDENT’S EDUCATION
FULFILLMENT USING REGRESSION ANALYSIS
Submitted by
RAM G ATHREYA
Roll No.: 1202FOSS0019 Reg. No.: 75812200021
A PROJECT REPORT
Submitted to the
FACULTY OF SCIENCE AND HUMANITIES
in partial fulfillment for the requirement of award of the degree of
MASTER OF SCIENCE IN
FREE / OPEN SOURCE SOFTWARE (CS-FOSS)
CENTRE FOR DISTANCE EDUCATION ANNA UNIVERSITY CHENNAI 600 025
AUGUST 2014
ii
CENTRE FOR DISTANCE EDUCATION
ANNA UNIVERSITY
CHENNAI 600 025
BONA FIDE CERTIFICATE
Certified that this Project report titled “FORECASTING A STUDENT’S
EDUCATION USING REGRESSION ANALYSIS” is the bona fide work of Mr. RAM G
ATHREYA, who carried out the research under my supervision. I certify further, that
to the best of my knowledge the work reported herein does not form part of any
other Project report or dissertation on the basis of which a degree or award was
conferred on an earlier occasion on this or any other candidate.
RAM G ATHREYA Dr. SRINIVASAN SUNDARARAJAN
Student at Anna University Professor
iii
CERTIFICATE OF VIVA-VOCE-EXAMINATION
This is to certify that Thiru/Mr. RAM G ATHREYA
(Roll No. 1202FOSS0019; Register No. 75812200021) has been subjected to Viva-
voce-Examination on 14 September 2014 at 9:30 AM at the Study centre The AU-
KBC research Centre, Madras Institute of Technology, Anna Universisty, Chrompet,
Chennai 600044.
Internal Examiner External Examiner
Name : Name :
(in capital letters) (in capital letters)
Designation : Designation :
Address : Address :
Coordinator centre
Name :
(in capital letters)
Designation :
Address :
Date :
iv
ACKNOWLEDGEMENT
I am highly indebted to my guide Dr. SRINIVASAN SUNDARARAJAN for his
guidance, monitoring, constant supervision, kind co-operation and encouragement
that helped me in completion of this project.
I would also like to express my special gratitude to AU-KBC faculties involved in
M.Sc. (CS-FOSS) course for their cordial support and guidance as well as for
providing necessary information regarding the project and also for their support in
completing the project.
Finally, I thank Center of Distance Education, Anna University for giving me an
opportunity to do this project.
v
ABSTRACT
Our government spends substantial amount of resources in educating our
children. Additionally several welfare schemes are introduced aimed especially at
underprivileged children to ensure that all of them complete a basic level of
education. In spite of these measures many students do not complete their basic
education.
The aim of this project is to formulate a Supervised Learning Algorithm
that will aid in identifying such students who have a higher likelihood of not
completing their education.
To perform this task the algorithm will perform Logistic Regression
Analysis on historical data of students from a given school. The historical data
includes basic background information (features) such as gender, community,
number of siblings etc. It must be noted that the historical data also contains
information on whether the student completed his/her education, which is the
outcome we are interested in. Typically a student finishing education will be
denoted using a value of 1 and a student not finishing will be denoted with a value
of 0.
Based on the training (historical) data a logistic classifier can be built. Such
a classifier after learning from the training set will develop specific weightages for
each of the features. These weightages can then be extrapolated into an equation
that can be used for prediction.
That is we can apply the equation on a current student (whose background
we already know) to calculate the probability that he/she will complete his/her
education.
vi
Such an algorithm will be beneficial to government agencies since it can
serve as an early warning system using which they can take more proactive action
to prevent a student from dropping out. Policy makers can also use it as a tool to
identify schools that are more vulnerable and direct their resources and energies to
help them.
vii
சசசசசசசசச
சசசச சசசச சசசச சசசசசசசசசச சசசசச சசசசசசச
சசசசசசச சசசச சசசசசசசசசசசசச. சசசசசசசச பல
சசசசசசசசசசசசசச சசசசசசசசசசச சசசசசசச சசசசச சசச
சசசசசசசச சசசச சசசசசசச சசசசச சசசசச சசசசசசசசச
சசசசசசசசசச சசசசசசசசசச சசசசசசச. சசசச
சசசசசசசசசசசச சசச பல சசசசசசசசச சசசச சசசசசசசச
சசசசசசச சசசசசசச.
சசசச சசசசசசசசசசச சசசசசசசச, சசசசசச சசசசச சசசசசச
சசசச சசசசசசசச சசசசசச சசசச சசசசசச சசசசசசசசச
சசசசசசசச சசசசசச சசசசச சசச சசசசசசசசசசசசசச
சசசசசச சசசசசசச சசசசசசசசச சசசசசச.
சசசசசசச சசசசசசசசசசசசசச சசசசசசசசச சசசசசசச
சசசசசசசசச சசசசசசசச சசசச சசசசசசசசச சசசசசசசசச
சசசசசசசசசசச சசசசச சசசச சசசசச சசசசச. சசசசசசசச
சசசச சசசசசசசச சசசசசசச சசசசச சசசசசச சசசசசசச,
சசசசசச, சசசசசசசச சசசச சசசசசச சசசசசசச சசசசசசச
சசசசசச சசசசச சசசசச / சசசசச சசசசச, சசசசசச சசசசசச
சசசசச சசச சசசசசசசசசசசசச சசசசசசசச சசசசசச
சசசசசசசசசசசசசச சசசசசசசசச (சசசசசசசசச) சசசசசசசச .
சசசசசசசச சசச சசசசசச சசசசசசச சசசசச சசச சசசசசசச 1
சசசசசசசசசசச சசசசசசசசசசசசசச சசசசசசச சசச சசசசசச
0 சசச சசசசசசச சசசசசசசசசசசசசச சசசசசசசச.
சசசசசசச சசசசசசசசசசசச (சசசசசசசச) சசசச
சசசசசசசசசச சசசசசசசசசசசச சசசசசசசசசசசச. சசசசசசச
சசசசசசசசசசச சசசசசசச சசசசச சசசசசசச சசசசசசச சசச
சசசசசசசசசசசச சசசசசசசசச சசசசசசசச சசசசசசசசசசச
viii
சசசசசசச சசசசசசசசசசச. சசசச சசசசசசசசசச சசசசசசச
சசசசசசச சசசசசசசசசச சசசசசசசச சசசசச சசச
சசசசசசசசசசசச சசசசசசசசசசசசச.
சசசசச சசசச (சசசச சசசசசசச சசசச சசசசசசச சசசசசசசச)
சசசச / சசசச சசசச / சசசசச சசசசச சசசசசசச சசசசசசசச
சசசசச சசசசசசசசச சசசசசசச சசசசசசசசச சசசசசச சசசச
சசசசசசசசசச சசசசசசசசசசசச சசசசசசசச.
சசச சசசசசசச சசசசசச சசசசசசசசசச சசசசசசச சசச
சசசசசச சசசசசச சசசசசசச சசசசசசசசசசசசச சசசசசசசசச
சசசசசச சசசசசசசச சசசசசசசசசசச சசச சசசசச
சசசசசசசசசச சசசசசசச சசசசசசசச சசசசசசசச சசசசசசசச
சசசசசசச சசச சசசசசசச சசசசசசச சசசசசசசசசசசசசசச
சசசசசசசசசச சசசசசசசசச. சசசசசசச சசசசசசசசசசசசச
சசசசசசச சசசசசச சசசசசசச சசசசசசச சசசசசச சசசசசச
சசசசசசசசசசசசசச சசசசச சசசசசசசச சசசசசசசச
சசசசசசச சசசசசசசச சசச சசசசசசசச சசச சசசசசசசசசச
சசசசசசசச.
ix
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO
ACKNOWLEDGEMENT iv
ABSTRACT v
ABSTRACT IN TAMIL vii
LIST OF FIGURES xii
LIST OF TABLES xiii
LIST OF ABBREVIATIONS xiv
1 INTRODUCTION 1
1.1 OVERVIEW OF THE PROJECT 1
1.2 LITERATURE SURVEY 2
1.3 PROPOSED SYSTEM 2
1.4 SCOPE 2
2 REQUIREMENT SPECIFICATION 4
2.1 INTRODUCTION 4
2.2 OVERALL DESCRIPTION 4
2.2.1 PRODUCT PERSPECTIVE 5
2.2.2 PRODUCT FUNCTIONS 5
3 PROJECT REQUIREMENTS 7
3.1 SOFTWARE REQUIREMENTS 7
3.2 HARDWARE REQUIREMENTS 7
4 SYSTEM DESIGN 9
x
4.1 METHODOLOGY 9
4.2 ALGORITHM 9
4.2.1 SUPERVISED LEARNING 10
4.2.2 CLASSIFICATION 11
4.2.3 LOGISTIC REGRESSION 13
4.3 DATA COLLECTION 15
4.3.1 FEATURE DETECTION 15
4.3.1.1 PERSONAL 15
4.3.1.2 ENVIRONMENTAL 15
4.3.1.3 SCHOOL 16
4.3.2 DATASET GENERATION 16
4.4 MODELING 18
4.4.1 HYPOTHESIS DEVELOPMENT 19
4.4.2 GENERALIZATION ERROR 19
4.5 VALIDATION 20
4.5.1 DATASET PARTITIONING 21
4.5.1.1 TRAINING DATASET 21
4.5.1.2 CV DATASET 22
4.5.2 COST FUNCTION 23
4.5.3 ERROR METRICS 24
4.5.3.1 TRAINING AND CV
ERROR 25
4.5.3.2 F1 SCORE 25
4.5.3.3 W – SCORE 26
4.5.4 LEARNING CURVES 27
4.6 PREDICTION 29
5 IMPLEMENTATION 31
5.1 R 31
xi
5.1.1 COST FUNCTION.R 31
5.1.2 F1SCORE.R 31
5.1.3 GENERATEDATASET.R 32
5.1.4 GENERATEVECTOR.R 34
5.1.5 INIT.R 36
5.1.6 LEARNINGCURVE.R 37
5.1.7 MYSQL.R 39
5.1.8 PERCRANK.R 39
5.1.9 PLOTLEARNINGCURVE.R 39
5.1.10 PREDICTION.R 40
5.1.11 PREDICTOR.R 40
5.1.12 RANDOMIZEDATASET.R 41
5.2 NODE.JS 41
5.2.1 APP.JS 41
5.2.2 PACKAGE.JSON 42
5.2.3 ROUTES.JS 43
5.2.4 INDEX.JADE 45
5.2.5 PREDICT.JADE 47
5.2.6 UPLOAD.JADE 52
6 RESULTS 54
6.1 DATASET UPLOAD 54
6.2 UPLOAD RESULT 55
6.3 PREDICTION 56
7 CONCLUSIONS 57
8 REFERENCES 58
xii
LIST OF FIGURES
FIGURE NO TITLE PAGE NO
4.1 Logistic Regression Curve
4.2 Dataset Generation
4.3 Modeling
4.4 Dataset Partitioning
4.5 Developing Multiple Models
4.6 Calculating Cross-Validation Errors
4.7 Single Subject Learning
4.8 Learning from Experience
4.9 Score & Learning Time vs Experience
4.10 Training & Cross – Validation Error
Convergence
4.11 Choosing the Best Model
4.12 Prediction
6.1 Upload Result
6.2 Prediction Screen
6.3 Predicting Student will not Dropout
6.4 Predicting Student will Dropout
xiii
LIST OF TABLES
TABLE NO TITLE PAGE NO
4.1 Sample Dataset 17
xiv
LIST OF ABBREVIATIONS
FOSS Free and Open Source Software
IDE Integrated Development Environment
OS Operating System
PTR Pupil Teacher Ratio
SCR Student Classroom Ratio
1
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW OF THE PROJECT
Dropout is a universal phenomenon of the education system in India, which is
spread across all levels of education, in all parts of the country, and across socio-
economic groups the dropout rates are much higher for educationally backward
states and districts. Girls in India tend to have higher dropout rates than boys.
Similarly, children belonging to the socially disadvantaged groups like Scheduled
Castes and Scheduled Tribes have the higher dropout rates in comparison to the
general population.
There are also regional and location wise differences and the children living in
rural areas are more likely to drop out of school. In order to reduce wastage and
improve the efficiency of education system, educational planners need to
understand and identify the social groups that are more susceptible to dropout and
the reasons for their dropping out.
Keeping the above context in perspective, it would be helpful to develop a
system or an algorithm that can systematically identify such vulnerable students
who have a higher likelihood of dropping out from school. The goal of this project
is to develop such an algorithm or system.
Hopefully such an algorithm or system could assist educational planners and
administrative staff of educational institutions to better allocate resources and
make better decisions, which could curb this growing dropout problem.
2
1.2 LITERATURE SURVEY
The literature survey covers existing research and studies with respect to the
dropout problem. They are grouped into three broad categories:
1 Research Papers
2 Surveys
3 Govt Reports
The detailed list of resources researched during the literature survey is
provided in the references section.
1.3 PROPOSED SYSTEM
The proposed system will implement an algorithm that will take in student
data as input and learn from it. This learned function, otherwise called as the
hypothesis will serve as an approximate explanation of the data. Error metrics and
validation techniques will be used to determine the accuracy of the hypothesis.
The best hypothesis that fits the data will then be used for prediction. The final
goal of the algorithm is to make reasonably accurate predictions of new unlabeled
data. Unlabeled data is data for which the outcome is unknown.
This system will be implemented in such a way that it can be operated from a
web interface where the user can upload datasets as well as make predictions
based on learned data.
1.4 SCOPE
3
The algorithm developed is an exploratory proof – of – concept system that
uses machine learning and statistical techniques to make predictions based on
student data. The validity of the results is entirely dependent on the accuracy of
the data and how the algorithm processes it.
Since comprehensive student data was not available for making the algorithm
as best as possible, this iteration of the system can only serve as a proof – of –
concept on what is possible and cannot be directly used in the real world, in its
present form, as a decision making or policy making tool.
4
CHAPTER 2
REQUIREMENT SPECIFICATION
2.1 INTRODUCTION
A software requirements specification (SRS) defines the requirements of a
software system. It is a description of the behavior of a system to be developed
and may include a set of use cases. In addition it also contains non-functional
requirements. Non-functional requirements impose constraints on the design or
implementation (such as performance requirements, quality standards, or design
constraints).
This project requires storage and processing of medium to large volumes of
data/datasets. Such datasets will be passed through the algorithm initially during a
training phase, during this time the algorithm will learn using the training data.
After training is completed the algorithm would then be required to make
predictions for new unlabeled data based on what it learned from the training data.
Additionally it would be helpful it the algorithm can be operated from a
Web User Interface which will be more user friendly than issuing commands from
the command line.
2.2 OVERALL DESCRIPTION
This section will outline a holistic description of the project, which includes
different perspectives, constraints, functional and non – functional requirements of
the project.
5
2.2.1 PRODUCT PERSPECTIVE
The system has 4 main tasks that are
Data Collection
Modeling
Validation
Prediction
In the data collection phase the data required for the
algorithm is gathered converted into a suitable form and supplied to
the system for learning.
In the modeling phase the algorithm tries to generate models
that try to explain the data that has been gathered. Machine Learning
techniques are used in this phase to generate multiple models of
which the best gets chosen in later stages.
In the validation phase the different models are evaluated
based on performance and the best among them is chosen as the
candidate algorithm that can be used for prediction
Finally in the prediction phase the chosen model is used for
making actual real world predictions.
2.2.2 PRODUCT FUNCTIONS
The system has two main functions that are
Training
6
Prediction
In the training phase the dataset is supplied to the algorithm using
which the best model is developed for prediction
In the prediction phase the learnt algorithm can be actually put to
use that is it can be used to make predictions for unlabeled data.
How these processes are implemented is explained in detail in
subsequent sections.
7
CHAPTER 3
PROJECT REQUIREMENTS
The project requirement is to develop an algorithm that can classify
students on whether they would complete education or not (dropout). To achieve
this a system needs to be created that can be operated from a web user interface
that will supply data for training or can make predictions based on already trained
data.
3.1 SOFTWARE REQUIREMENTS
The software requirements for this project are:
R – R is a free software programming language and software
environment for statistical computing and graphics.
Node.js - Node.js is a cross-platform runtime environment and a
library for running applications written in JavaScript outside the
browser (for example, on the server)
Netbeans - NetBeans is an integrated development
environment (IDE) for developing primarily with Java, but also with
other languages, in particular PHP, C++, Node.js & HTML5
RStudio – RStudio is a free and open source (FOSS) integrated
development environment for R, a programming language for
statistical computing and graphics
LINUX – LINUX is a POSIX-compliant computer operating system
(OS) assembled under the model of free and open source software.
3.2 HARDWARE REQUIREMENTS
8
The hardware requirements define a set of (minimum) hardware that must
be available to run the system.
Hardware System that can support LINUX Operating System
2 – 4 GB of RAM
Internet Connectivity
9
CHAPTER 4
SYSTEM DESIGN
System design is the process of defining the architecture, components,
modules, interfaces and data for a system to satisfy specified requirements. System
design encompasses activities such as systems analysis, systems architecture and
systems engineering.
4.1 METHODOLOGY
A software development methodology or system development methodology
in software engineering is a framework that is used to structure, plan and control
the process of developing a software system.
This project consists of four distinct phases that are
Data Collection
Modeling
Validation
Prediction
4.2 ALGORITHM
The system will use a Logistic Regression Classifier, which is a Supervised
Machine Learning Algorithm. This algorithm will take student data as input and
predict an outcome. Outcomes are typically binary that is either a TRUE or
FALSE. A TRUE value indicates that a student will dropout while FALSE means
the student will not dropout.
10
Since the algorithm will return only one of two possible outcomes it can
also be called as a binary/binomial classifier.
4.2.1 SUPERVISED LEARNING
Supervised learning is the machine-learning task of inferring a
function from labeled training data. The training data consist of a set of
training examples. Typically the training data for this project will consist of
data about students based on features that will be defined later in this
document.
In supervised learning, each example is a pair consisting of an input
object (typically a vector) and a desired output value (also called the
supervisory signal). A supervised learning algorithm analyzes the training
data and produces an inferred function, which can be used for mapping new
examples. New examples are usually unlabeled data that we need to predict.
An optimal scenario will allow for the algorithm to correctly determine the
class labels for unseen instances. This requires the learning algorithm to
generalize from the training data to unseen situations in a "reasonable" way.
In order to solve a given problem of supervised learning, the system
has to perform the following steps:
1. Determine the type of training examples : The kind of data that is
to be used as the training set needs to be determined first. In the case
of handwriting analysis, for example, this might be a single
handwritten character, an entire handwritten word, or an entire line
of handwriting
2. Gather a training set : The training set needs to be representative
of the real-world use of the function. Thus, a set of input objects is
11
gathered and corresponding outputs are also gathered, either from
human experts or from measurements
3. Determine the input feature representation of the learned
function: The accuracy of the learned function depends strongly on
how the input object is represented. Typically, the input object is
transformed into a feature vector, which contains a number of
features that are descriptive of the object. The number of features
should not be too large; but should contain enough information to
accurately predict the output.
4. Determine the learning algorithm : The correct learning algorithm
that models the available data should be identified and applied. For
example the learning algorithm may be support vector machines or
decision trees
5. Complete the design : Run the learning algorithm on the gathered
training set. Some supervised learning algorithms require certain
control parameters. These parameters may be adjusted by optimizing
performance on a subset (called a validation set) of the training set,
or via cross-validation.
6. Evaluate the accuracy of the learned function : After parameter
adjustment and learning, the performance of the resulting function
should be measured on a test set that is separate from the training
set.
4.2.2 CLASSIFICATION
In machine learning and statistics, classification is the problem of
identifying to which of a set of categories (sub-populations) a new
observation belongs, on the basis of a training set of data containing
observations (or instances) whose category membership is known. The
12
individual observations are analyzed into a set of quantifiable properties,
known as various explanatory variables, features, etc. These properties may
variously be categorical (e.g. "A", "B", "AB" or "O", for blood type),
ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number
of occurrences of a part word in an email) or real-valued (e.g. a
measurement of blood pressure). Some algorithms work only in terms of
discrete data and require that real-valued or integer-valued data be
discretized into groups (e.g. less than 5, between 5 and 10, or greater than
10). An example would be assigning a given email into "spam" or "non-
spam" classes or assigning a diagnosis to a given patient as described by
observed characteristics of the patient (gender, blood pressure, presence or
absence of certain symptoms, etc.).
An algorithm that implements classification, especially in a concrete
implementation, is known as a classifier. The term "classifier" sometimes
also refers to the mathematical function, implemented by a classification
algorithm, that maps input data to a category.
In the terminology of machine learning, classification is considered
an instance of supervised learning, i.e. learning where a training set of
correctly identified observations is available. The corresponding
unsupervised procedure is known as clustering or cluster analysis, and
involves grouping data into categories based on some measure of inherent
similarity (e.g. the distance between instances, considered as vectors in a
multi-dimensional vector space).
In statistics, where classification is often done with logistic
regression or a similar procedure, the properties of observations are termed
explanatory variables (or independent variables, regressors, etc.), and the
13
categories to be predicted are known as outcomes, which are considered to
be possible values of the dependent variable. In machine learning, the
observations are often known as instances, the explanatory variables are
termed features (grouped into a feature vector), and the possible categories
to be predicted are classes. There is also some argument over whether
classification methods that do not involve a statistical model can be
considered "statistical".
4.2.3 LOGISTIC REGRESSION
In statistics, logistic regression, or logit regression, is a type of
probabilistic statistical classification model. It is also used to predict a
binary response from a binary predictor, used for predicting the outcome of
a categorical dependent variable (i.e., a class label) based on one or more
predictor variables (features). That is, it is used in estimating the parameters
of a qualitative response model. The probabilities describing the possible
outcomes of a single trial are modeled, as a function of the explanatory
(predictor) variables, using a logistic function. Logistic Regression is used
to refer specifically to the problem in which the dependent variable is
binary—that is, the number of available categories is two, while problems
with more than two categories are referred to as multinomial logistic
regression.
Logistic regression measures the relationship between a categorical
dependent variable and one or more independent variables, which are
usually (but not necessarily) continuous, by using probability scores as the
predicted values of the dependent variable.
14
Fig 4.1 : Logistic Regression Curve
The formula for Logistic Regression can be expressed as :
𝐹(𝑥) = 1
1 + 𝑒−𝑥
Eq 4.1 : Logistic Regression Formula
where :
F(x) is the output
x is the input
e is Euler’s number
It must be noted that 𝐹(𝑥) can have a value only between 0 to 1 for
any value of x that may be between (−∞, ∞) . Using the above equation we
can define a value 𝑘 𝜖 (0, 1) such that all values of 𝐹(𝑥) ≥ 𝑘 is true while
those lesser are false or vice versa, thereby classifying the data into two
distinct parts.
15
4.3 DATA COLLECTION
4.3.1 FEATURE DETECTION
Based on the literature survey six features have been identified as
major observable factors that can affect the final outcome regarding the
education fulfillment of a student.
The six features can be grouped into three categories that are:
1. Personal Features
2. Environmental Features
3. School Features
4.3.1.1 PERSONAL
Personal features are those features that are based on the
characteristics of the student or his/her parents, family background
etc. The personal features that are being considered by the algorithm
are:
1. Gender: Values can be Male or Female
2. Poverty: Values can be Yes or No
3. Community: Values can be General, OBC, SC, ST
4.3.1.2 ENVIRONMENTAL
Environmental features are those features that are based on
the student’s environment, locality, geography etc. The
16
environmental features that are being considered by the algorithm
are:
1. Rural: Values can be Yes or No
4.3.1.3 SCHOOL
School features are those features that are based on the
characteristics of the school where the student studies. The school
features that are being considered by the algorithm are:
Pupil Teacher Ratio: Pupil–teacher ratio is the number of students
who attend a school or university divided by the number of teachers
in the institution. For example, a pupil–teacher ratio of 10:1
indicates that there are 10 students for every one teacher. The term
can also be reversed to create a teacher–pupil ratio.
Student Classroom Ratio: Student – classroom ratio is the number
of students per classroom in an education institution. For example, a
student – classroom ratio of 40:1 indicates that there are 40 students
for every classroom.
1. Pupil Teacher Ratio: Values can be Low (1 Teacher :
<30 Students), Medium (1 Teacher : 30 – 40 Students) and
High (1 Teacher : 40+ Students)
2. Student Classroom Ratio: Values can be Low (1
Classroom : <30 Students), Medium (1 Classroom: 30 –
40 Students) and High (1 Classroom: 40+ Students)
4.3.2 DATASET GENERATION
17
Based on statistics derived from the literature survey and the features
mentioned above the dataset for modeling is generated. The tables given
below extrapolate statistical findings compiled from the literature survey:
Feature Value Distribution Dropout Chance
Gender Male 52% 39%
Gender Female 48% 41%
Poverty Yes 22% 80%
Poverty No 78% 27%
Rural Yes 75% 45%
Rural No 25% 20%
Community General 30% 10%
Community OBC 40% 48%
Community SC 20% 64%
Community ST 10% 69%
PTR Low 20% 15%
PTR Medium 30% 35%
PTR High 50% 55%
SCR Low 18% 22%
SCR Medium 33% 25%
SCR High 49% 60%
Table 4.1 : Sample Dataset
The above table shows the distribution of each feature in the student
population and the corresponding dropout chance of each feature within
that population. For example when considering 100 students there are 52
18
male students and 42 female students and the chance that a female student
drops out is 41%.
Overall Dropout Percentage was found to be 40%. That is 40% of
the student population dropout of school. Using the above statistics a
dataset can be generated for further analysis.
Fig 4.2 : Dataset Generation
4.4 MODELING
Data modeling in software engineering is the process of creating a data
model for an information system by applying formal data modeling techniques.
19
Fig 4.3 : Modeling
4.4.1 HYPOTHESIS DEVELOPMENT
A Hypothesis (plural hypotheses) is a proposed explanation for a
phenomenon. A working hypothesis is a provisionally accepted hypothesis
proposed for further research. In the context of Machine Learning the
hypotheses is also called as the Learned Function.
In the context of this project the learned function is a working
hypothesis that tries to explain the training dataset of students. Based on the
observations/outcomes of the training dataset the learned algorithm will
develop weightages for each of the features that have been selected. These
weightages will then be used for predicting outcomes in a future dataset.
4.4.2 GENERALIZATION ERROR
The generalization error of a machine-learning model is a function
that measures how well a learning machine generalizes to unseen data. It is
20
measured as the distance between the error on the training set and the test
set and is averaged over the entire set of possible training data that can be
generated after each iteration of the learning process. It has this name
because this function indicates the capacity of a machine that learns with
the specified algorithm to infer a rule (or generalize).
The theoretical model assumes a probability distribution of the
examples, and a function giving the exact target. The model can also
include noise in the example (in the input and/or target output). The
generalization error is usually defined as the expected value of the square of
the difference between the learned function and the exact target (mean-
square error)
The performance of a machine learning algorithm is measured by
plots of the generalization error values through the learning process and are
called learning curves.
4.5 VALIDATION
In statistics, model validation is the process of deciding whether the
numerical results quantifying hypothesized relationships between variables,
obtained from machine learning analysis, are in fact acceptable as descriptions of
the data.
The validation process can involve analyzing the goodness of fit of the
model, analyzing whether the model residuals are random, and checking whether
the model's predictive performance deteriorates substantially when applied to data
that were not used in model estimation.
21
4.5.1 DATASET PARTITIONING
In model validation for assessing the results of statistical analysis the
dataset is generally partitioned into two separate datasets. They are :
1. Training Dataset
2. Cross – Validation(CV) Dataset
The model is typically trained on the training dataset and then tested
on the cross – validation dataset that contains examples that are
independent of the training data. The actual training, cross – validation split
is upto the person doing the analysis. Usually ranges between 80-20%
(training – cv) or 70-30% is preferred so that the model has enough
examples for training the model.
Fig 4.4 : Dataset Partitioning
4.5.1.1 TRAINING DATASET
22
A training set is a set of data used in various areas of
information science to discover potentially predictive relationships.
Training sets are used in artificial intelligence, machine learning,
genetic programming, intelligent systems, and statistics. In all these
fields, a training set has much the same role and is often used in
conjunction with a test set.
Fig 4.5 : Developing Multiple Models
4.5.1.2 CV DATASET
Cross-validation, sometimes called rotation estimation, is a
model validation technique for assessing how the results of a
statistical analysis will generalize to an independent data set. It is
mainly used in settings where the goal is prediction, and one wants
to estimate how accurately a predictive model will perform in
practice. In a prediction problem, a model is usually given a dataset
of known data on which training is run (training dataset), and a
dataset of unknown data (or first seen data) against which the model
is tested (testing dataset). The goal of cross validation is to define a
23
dataset to "test" the model in the training phase (i.e., the validation
dataset), in order to limit problems like overfitting, give an insight
on how the model will generalize to an independent data set (i.e., an
unknown dataset, for instance from a real problem), etc.
One round of cross-validation involves partitioning a sample
of data into complementary subsets, performing the analysis on one
subset (called the training set), and validating the analysis on the
other subset (called the validation set or testing set). To reduce
variability, multiple rounds of cross-validation are performed using
different partitions, and the validation results are averaged over the
rounds.
Fig 4.6 : Calculating Cross-Validation Errors
4.5.2 COST FUNCTION
In mathematical optimization, statistics, decision theory and machine
learning, a cost function or loss function is a function that maps an event or
values of one or more variables onto a real number intuitively representing
24
some "cost" associated with the event. An optimization problem seeks to
minimize a loss function. An objective function is either a loss function or
its negative (sometimes called a reward function or a utility function), in
which case it is to be maximized.
In statistics, typically a loss function is used for parameter
estimation, and the event in question is some function of the difference
between estimated and true values for an instance of data.
The cost function is expressed as :
𝐽(𝜃) = 1
2𝑚 ∑(ℎ𝜃(𝑥(𝑖)) − 𝑦(𝑖))2
𝑚
𝑖=1
Eq 4.2 : Cost Function or Error Function
where :
J is the Cost
m is the number of training examples
h(x) is the hypothesis
y is the actual value or the result vector
4.5.3 ERROR METRICS
Error metrics are systematic benchmarking measures that are used
for calculating the accuracy or effectiveness of the system. The cost
function is described above is a good example of an error metric. The
following error metrics are used for validation of the generated models and
in choosing the best among them:
25
Training and CV Error
F1 Score
W – Score
4.5.3.1 TRAINING AND CV ERROR
Training error is cost function error of the trained model on
the training set. That is after training the model the training dataset is
supplied again to the model as input to make predictions. These
predictions made by the model are compared against the actual
outcomes in the dataset and the error between the two is calculated
using the cost function formula. The resulting value is the cost
function error.
The cross – validation error is similar to the training error
except it is calculated on the cross – validation set. The benefit here
is that the cross – validation set is new data and has none of the
training examples of the training set and thus can be a better estimate
of the accuracy of the system. Ideally the system’s cross – validation
error should be similar to the training error in which case the model
is a good estimate of the underlying data.
4.5.3.2 F1 Score
In statistical analysis of binary classification, the F1 score
(also F-score or F-measure) is a measure of a test's accuracy. It
considers both the precision p and the recall r of the test to compute
the score: p is the number of correct results divided by the number of
all returned results and r is the number of correct results divided by
26
the number of results that should have been returned. The F1 score
can be interpreted as a weighted average of the precision and recall,
where an F1 score reaches its best value at 1 and worst score at 0.
𝐹1 = 2 .𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 . 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Eq 4.3 : F1 – Score
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
Eq 4.4 : Precision
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
Eq 4.5 : Recall
4.5.3.3 W – Score
The W-Score is a combination of the training, cross
validation errors using which the best model gets chosen. The best
model that gets chosen will have the least W – Score. The W – Score
is expressed as :
𝑊 = (1 − 𝑓1).∑ 𝑇𝑟𝑎𝑖𝑛 𝐸𝑟𝑟𝑜𝑟
𝑁𝑇.∑ 𝐶𝑉 𝐸𝑟𝑟𝑜𝑟
𝑁𝐶𝑉
27
Eq 4.6 : W - Score
where:
W – W-Score
f1 – F1 Score
NT – Number of Training Examples
NCV – Number of Cross – Validation Examples
4.5.4 LEARNING CURVES
Fig 4.7 : Single Subject Learning Fig 4.8 : Learning from Experience
Fig 4.9 : Score & Learning Time vs Experience
28
A learning curve is a graphical representation of the increase of learning
(vertical axis) with experience (horizontal axis). Although the curve for a single
subject may be erratic (Fig 4.7), when a large number of trials are averaged, a
smooth curve results, which can be described with a mathematical function (Fig
4.8). Depending on the metric used for learning (or proficiency) the curve can
either rise or fall with experience (Fig 4.9).
Within the context of the project the horizontal axis will be training
examples, which is basically derived from experience, and the vertical axis is the
cost function error. Ideally the cost function error should decrease with increase in
training examples.
But there are two types of errors, that is the training error and the cross –
validation error. With increase in training examples the training error would
increase gradually so as to prevent overfitting and since the training dataset has to
explain a diverse spectrum of examples. But it should not increase exponentially.
Also if the model is efficient then it should perform just as good on new data as it
does on the training dataset. So the cross – validation error must decrease with
increase in training examples.
Thus the ideal model will have a small increase in training error with
increase in training examples and the cross – validation error should decrease with
increase in training examples and the two errors must converge as shown in (Fig
4.10).
29
Fig 4.10 : Training & Cross – Validation Error Convergence
Fig 4.11 : Choosing the Best Model
4.6 PREDICTION
Prediction is the final step in the process. After selecting the best model that
fits the given dataset the model can be put to use on actual real world unlabeled
data. That is it can be used to predict data for which the outcomes are not known.
30
The prediction process begins with the algorithm being supplied unlabeled student
data using which it predicts an outcome, which is whether the student will dropout
or not.
Fig 4.12 : Prediction
31
CHAPTER 5
IMPLEMENTATION
5.1 R
5.1.1 COST FUNCTION
costFunction <- function(dataset, prediction){
dataset <- as.numeric(dataset);
prediction <- as.numeric(prediction);
m = length(dataset);
J = 1 / (2 * m) * sum((dataset - prediction) ^ 2);
return(J);
}
5.1.2 F1SCORE.R
f1Score = function(data, prediction){
data <- as.numeric(data);
prediction <- as.numeric(prediction);
true_positives <- sum(data);
false_positives <- sum(prediction == !data & prediction);
false_negatives <- sum(data == !prediction & !prediction);
precision <- true_positives / (true_positives + false_positives);
recall <- true_positives / (true_positives + false_negatives);
32
return(as.numeric(2 * precision * recall / (precision + recall)));
}
5.1.3 GENERATEDATASET.R
generateDataset <- function(n, dropout_percentage){
source('generateVector.R');
source('percRank.R')
#Gender List
gender_list <- list(data = factor(c("Male", "Female")),
dist = list(Male = 0.52, Female = 0.48),
w = list(Male = 0.39, Female = 0.41));
#Poverty List
poverty_list <- list(data = factor(c("Yes", "No")),
dist = list(Yes = 0.22, No = 0.78),
w = list(Yes = 0.80, No = 0.27));
#Community List
community_list <- list(data = factor(c("General", "OBC", "SC", "ST")),
dist = list(General = 0.30, OBC = 0.40, SC = 0.20, ST = 0.10),
w = list(General = 0.10, OBC = 0.48, SC = 0.64, ST = 0.69));
#Rural List
rural_list <- list(data = factor(c("Yes", "No")),
dist = list(Yes = 0.75, No = 0.25),
w = list(Yes = 0.45, No = 0.20));
33
#Pupil Teacher Ratio List
ptr_list <- list(data = factor(c("Low", "Medium", "High"), order = TRUE),
dist = list(Low = 0.20, Medium = 0.30, High = 0.50),
w = list(Low = 0.15, Medium = 0.35, High = 0.55));
#Student Classroom Ratio List
scr_list <- list(data = factor(c("Low", "Medium", "High"), order = TRUE),
dist = list(Low = 0.18, Medium = 0.33, High = 0.49),
w = list(Low = 0.22, Medium = 0.25, High = 0.60));
Gender <- generateVector(n, gender_list);
Poverty <- generateVector(n, poverty_list);
Community <- generateVector(n, community_list);
Rural <- generateVector(n, rural_list);
PTR <- generateVector(n, ptr_list);
SCR <- generateVector(n, scr_list);
getW <- function(list, vector, index){
value <- as.character(vector[index]);
return(as.numeric(list$w[value]));
}
weightage_vector <- vector('numeric');
for(i in 1:n){
gender_weightage <- getW(gender_list, Gender, i);
poverty_weightage <- getW(poverty_list, Poverty, i);
community_weightage <- getW(community_list, Community, i);
34
rural_weightage <- getW(rural_list, Rural, i);
weightage_vector[i] <-
gender_weightage +
poverty_weightage +
community_weightage +
rural_weightage +
getW(ptr_list, PTR, i) +
getW(scr_list, SCR, i)
;
}
w_rank <- percRank(weightage_vector);
Dropout <- w_rank >= (1 - dropout_percentage);
data <- data.frame(Gender, Poverty, Community, Rural, PTR, SCR,
Dropout);
write.csv(file="data.csv", x=data)
}
5.1.4 GENERATEVECTOR.R
generateVector <- function(n, list){
dist <- list$dist;
p <- c(length(list$data));
#Generate probability series
k = 1;
for(i in dist){
if(k == 1){
35
p[k] = i;
}
else{
p[k] = i + p[k - 1];
}
k = k + 1;
}
#Get index of value that will be added to the vector
getIndex <- function(p, r){
k = 1;
for(i in p){
if(r <= i){
break;
}
k = k + 1;
}
return(k);
}
#Generate Vector
result <- factor(list$data);
for(i in 1:n) {
index <- getIndex(p, runif(1));
value <- list$data[index];
result[i] = value;
}
return(result);
36
}
5.1.5 INIT.R
setwd('/Users/ramathreya/Sites/foss-project/r');
source('generateDataset.R');
source('randomizeDataset.R');
source('predictor.R');
source('costFunction.R');
source('f1Score.R');
source('learningCurve.R');
source('plotLearningCurve.R');
partition <- 0.7;
start <- 100;
interval <- 500;
dataset <- read.csv(file="input.csv");
n <- nrow(dataset);
png('../public/plot.png');
opar <- par(no.readonly=TRUE)
par(mfrow=c(3, 3));
z <- c();
37
train <- c();
cv <- c();
f1 <- c();
seq_range <- seq(0.1, 0.9, 0.1);
for(i in seq_range){
curves <- learningCurve(dataset, start, n, interval, partition, "Dropout",
predictor, i);
plotLearningCurve(curves$m, curves$train, curves$test, c("Plot when Z is
", i), "Training Examples", "Error");
train_last <- tail(curves$train, 1);
cv_last <- tail(curves$test, 1);
z <- c(z, i);
cv <- c(cv, sum(abs(curves$test)) / length(curves$test));
train <- c(train, sum(abs(curves$train)) / length(curves$train));
f1 <- c(f1, sum(abs(curves$f1)) / length(curves$f1));
}
w <- (1-f1) * train * cv;
analysis <- data.frame(z, train, cv, f1, w);
min_index <- which(w==min(w));
write.csv(seq_range[min_index], file="out.z")
dev.off();
5.1.6 LEARNINGCURVE.R
38
learningCurve <- function(dataset, start, end, interval, partition, column,
predictor, z){
train_plot <- c();
test_plot <- c();
x <- c();
f1 <- c();
for(i in seq(start, end, interval)){
m <- i * partition;
training_dataset <- dataset[1:m, ];
test_dataset <- dataset[(m+1):i, ];
train_actual <- unlist(training_dataset[column]);
test_actual <- unlist(test_dataset[column]);
predictor_formula <- predictor(training_dataset);
train_pred <- predict(predictor_formula, type="response",
training_dataset) >= z;
test_pred <- predict(predictor_formula, type="response", test_dataset) >=
z;
train_cost <- costFunction(train_actual, train_pred);
test_cost <- costFunction(test_actual, test_pred);
f1 <- c(f1, f1Score(test_actual, test_pred));
x <- c(x, i);
39
train_plot <- c(train_plot, train_cost);
test_plot <- c(test_plot, test_cost);
}
return(list(train=train_plot, test=test_plot, m=x, f1=f1));
}
5.1.7 MYSQL.R
library(RMySQL)
db = dbConnect(MySQL(), user='root', password='', dbname='mobile_crm',
host='localhost')
5.1.8 PERCRANK.R
percRank <- function(x) trunc(rank(x)) / length(x)
5.1.9 PLOTLEARNINGCURVE.R
plotLearningCurve <- function(m, train_plot, test_plot, title, xlab, ylab,
rnge=range(0, 0.15)){
plot(m, train_plot, type="l", col="red", xlab=NA, ylab=NA, ylim=rnge);
par(new=TRUE);
plot(m, test_plot, type="l", col="green", xlab=NA, ylab=NA, ylim=rnge,
axes=FALSE);
par(new=TRUE);
legend('topright', c("Training", "C.V"),
bty="n", lty=1, lwd=0.5, cex=0.5,
col=c('red', 'green'));
40
title(title,
xlab=xlab,
ylab=ylab);
}
5.1.10 PREDICTION.R
setwd('/Users/ramathreya/Sites/foss-project/r');
source('predictor.R');
dataset <- read.csv(file="input.csv");
z <- read.csv(file="out.z")
z <- z[1, 'x'];
predictor_formula <- predictor(dataset);
input <- read.csv('predict-input.csv');
dataset <- rbind(dataset, input)
l <- nrow(dataset);
prediction <- predict(predictor_formula, newdata=dataset,
type="response");
prediction <- (prediction[l] >= z);
fileConn<-file("output")
writeLines(c(toString(prediction)), fileConn)
close(fileConn)
5.1.11 PREDICTOR.R
41
predictor <- function(dataset){
formula <- glm(
formula = Dropout ~ cbind(Gender, Poverty, Community, Rural, PTR, SCR),
family = binomial,
data = dataset);
return(formula);
}
5.1.12 RANDOMIZEDATASET.R
randomizeDataset <- function(dataset){
result <- subset(dataset, FALSE);
l <- nrow(dataset);
s <- sample(seq(1, l), l);
for(i in 1:l){
result[i, ] <- dataset[s[i], ];
}
return(result);
}
5.2 NODE.JS
5.2.1 APP.JS
;
var express = require('express');
var http = require('http');
var path = require('path');
42
var bodyParser = require('body-parser');
app = express();
app.configure(function() {
app.set('views', __dirname + '/app/views');
app.set('view engine', 'jade');
app.use(express.static(path.join(__dirname, 'public')));
app.use(express.cookieParser());
app.use(express.methodOverride());
app.use(express.session({secret: 'keyboard cat'}));
app.use(bodyParser.json());
app.use(express.json()); // to support JSON-encoded bodies
app.use(express.urlencoded()); // to support URL-encoded bodies
app.locals.basedir = path.join(__dirname, '/app/views');
app.use(app.router);
app.basepath = __dirname;
require('./routes')();
http.createServer(app).listen(3000, function() {
console.log('Server Started');
});
});
5.2.2 PACKAGE.JSON
{
"name": "foss-project",
"scripts": {
43
"start": "node app"
},
"dependencies": {
"body-parser": "̂ 1.5.2",
"connect": "*",
"express": "3.4.0",
"formidable": "1.0.15",
"jade": "*",
"request": "2.x"
},
"engines": {
"node": "0.10.x",
"npm": "1.2.x"
}
}
5.2.3 ROUTES.JS
;
var formidable = require('formidable'),
util = require('util'),
fs = require('fs'),
sys = require('sys'),
exec = require('child_process').exec;
module.exports = function() {
app.get('/', function(req, res) {
res.render('index');
});
44
app.post('/upload', function(req, res) {
// parse a file upload
var form = new formidable.IncomingForm();
form.parse(req, function(err, fields, files) {
//Write to CSV file within r folder
fs.readFile(files.upload.path, function(err, data) {
var newPath = __dirname + "/r/input.csv";
fs.writeFile(newPath, data, function(err) {
function puts(error, stdout, stderr) {
res.render('upload');
}
exec("Rscript r/init.R", puts);
});
});
});
return;
});
app.get('/predict', function(req, res) {
res.render('predict');
});
app.post('/predict', function(req, res) {
var json = JSON.parse(req.body.json);
var key_string = '"",', value_string = '"",';
45
for(var i in json){
key_string += json[i].name + ',';
value_string += json[i].value + ',';
}
key_string += 'Dropout'
value_string += '""';
var string = key_string + '\n' + value_string + '\n';
fs.writeFile('r/predict-input.csv', string, function(err) {
function puts(error, stdout, stderr) {
fs.readFile('r/output', 'utf-8', function(err, data) {
res.end(data);
});
}
exec("Rscript r/prediction.R", puts);
});
});
};
5.2.4 INDEX.JADE
doctype html
html
head
title Dashboard
meta(charset="UTF-8")
meta(content='width=device-width, initial-scale=1, maximum-scale=1,
user-scalable=no' name='viewport')
46
link(rel="stylesheet",href="css/bootstrap.min.css",type="text/css")
link(rel="stylesheet",href="css/font-
awesome.min.css",type="text/css")
link(rel="stylesheet",href="css/ionicons.min.css",type="text/css")
link(rel="stylesheet",href="css/morris/morris.css",type="text/css")
link(rel="stylesheet",href="css/jvectormap/jquery-jvectormap-
1.2.2.css",type="text/css")
link(rel="stylesheet",href="css/bootstrap-wysihtml5/bootstrap3-
wysihtml5.min.css",type="text/css")
link(rel="stylesheet",href="css/AdminLTE.css",type="text/css")
body(class="skin-blue")
header(class="header")
a(href="/",class="logo") FOSS Project
nav(class="navbar navbar-static-top",role="navigation")
a(href="#",class="navbar-btn sidebar-toggle",data-
toggle="offcanvas",role="button")
span(class="sr-only") Toggle Navigation
span(class="icon-bar")
span(class="icon-bar")
span(class="icon-bar")
div(class="wrapper row-offcanvas row-offcanvas-left")
aside(class="left-side sidebar-offcanvas")
section(class="sidebar")
ul(class="sidebar-menu")
li
a(href="/")
i(class="fa fa-upload")
span Upload
47
li
a(href="/predict")
i(class="fa fa-search")
span Predict
aside(class="right-side")
section(class="content-header")
h1 Upload
section
div(class="box box-primary")
form(action="/upload",enctype="multipart/form-
data",method="post",role="form")
div(class="box-body")
div(class="form-group")
input(type="file",name="upload",multiple="multiple")
div(class="box-footer")
button(type="submit",class="btn btn-primary")
Upload
script(src="js/jquery.js")
script(src="js/bootstrap.min.js")
script(src="js/plugins/jvectormap/jquery-jvectormap-1.2.2.min.js")
script(src="js/plugins/jvectormap/jquery-jvectormap-world-mill-
en.js")
script(src="js/AdminLTE/app.js")
5.2.5 PREDICT.JADE
48
doctype html
html
head
title Dashboard
meta(charset="UTF-8")
meta(content='width=device-width, initial-scale=1, maximum-scale=1,
user-scalable=no' name='viewport')
link(rel="stylesheet",href="css/bootstrap.min.css",type="text/css")
link(rel="stylesheet",href="css/font-
awesome.min.css",type="text/css")
link(rel="stylesheet",href="css/ionicons.min.css",type="text/css")
link(rel="stylesheet",href="css/morris/morris.css",type="text/css")
link(rel="stylesheet",href="css/jvectormap/jquery-jvectormap-
1.2.2.css",type="text/css")
link(rel="stylesheet",href="css/bootstrap-wysihtml5/bootstrap3-
wysihtml5.min.css",type="text/css")
link(rel="stylesheet",href="css/AdminLTE.css",type="text/css")
body(class="skin-blue")
header(class="header")
a(href="/",class="logo") FOSS Project
nav(class="navbar navbar-static-top",role="navigation")
a(href="#",class="navbar-btn sidebar-toggle",data-
toggle="offcanvas",role="button")
span(class="sr-only") Toggle Navigation
span(class="icon-bar")
span(class="icon-bar")
span(class="icon-bar")
div(class="wrapper row-offcanvas row-offcanvas-left")
49
aside(class="left-side sidebar-offcanvas")
section(class="sidebar")
ul(class="sidebar-menu")
li
a(href="/")
i(class="fa fa-upload")
span Upload
li
a(href="/predict")
i(class="fa fa-search")
span Predict
aside(class="right-side")
section(class="content-header")
h1 Predict
section
div(class="box box-primary")
form(action="#",enctype="multipart/form-
data",method="post",role="form",id="form")
div(class="box-body")
div(class="form-group col-md-2")
label Gender
select(class="form-control",name="Gender")
option(value="Male") Male
option(value="Female") Female
div(class="form-group col-md-2")
label Poverty
select(class="form-control",name="Poverty")
option(value="Yes") Yes
option(value="No") No
50
div(class="form-group col-md-2")
label Community
select(class="form-control",name="Community")
option(value="General") General
option(value="OBC") OBC
option(value="SC") SC
option(value="ST") ST
div(class="form-group col-md-2")
label Rural
select(class="form-control",name="Rural")
option(value="Yes") Yes
option(value="No") No
div(class="form-group col-md-2")
label PTR
select(class="form-control",name="PTR")
option(value="Low") Low
option(value="Medium") Medium
option(value="High") High
div(class="form-group col-md-2")
label SCR
select(class="form-control",name="SCR")
option(value="Low") Low
option(value="Medium") Medium
option(value="High") High
div(class="box-footer",style="margin-left: 5px;")
button(type="button",class="btn btn-
primary",id="submit") Predict
label(id="outcome",style="margin-left: 10px;")
51
script(src="js/jquery.js")
script(src="js/bootstrap.min.js")
script(src="js/plugins/jvectormap/jquery-jvectormap-1.2.2.min.js")
script(src="js/plugins/jvectormap/jquery-jvectormap-world-mill-
en.js")
script(src="js/AdminLTE/app.js")
script(type="text/javascript").
$(document).ready(function(){
$('#submit').click(function(){
var json = JSON.stringify($('#form').serializeArray());
$.ajax({
url: '/predict',
method: 'post',
data: {
json: json
},
success: function(response){
var label = $('#outcome');
if(response.indexOf("TRUE") >= 0){
label.css('color', 'red');
label.html('Student will Dropout');
}
else{
label.css('color', 'green');
label.html('Student will Not Dropout');
}
}
});
});
52
});
5.2.6 UPLOAD.JADE
doctype html
html
head
title Dashboard
meta(charset="UTF-8")
meta(content='width=device-width, initial-scale=1, maximum-scale=1,
user-scalable=no' name='viewport')
link(rel="stylesheet",href="css/bootstrap.min.css",type="text/css")
link(rel="stylesheet",href="css/font-
awesome.min.css",type="text/css")
link(rel="stylesheet",href="css/ionicons.min.css",type="text/css")
link(rel="stylesheet",href="css/morris/morris.css",type="text/css")
link(rel="stylesheet",href="css/jvectormap/jquery-jvectormap-
1.2.2.css",type="text/css")
link(rel="stylesheet",href="css/bootstrap-wysihtml5/bootstrap3-
wysihtml5.min.css",type="text/css")
link(rel="stylesheet",href="css/AdminLTE.css",type="text/css")
body(class="skin-blue")
header(class="header")
a(href="/",class="logo") FOSS Project
nav(class="navbar navbar-static-top",role="navigation")
a(href="#",class="navbar-btn sidebar-toggle",data-
toggle="offcanvas",role="button")
span(class="sr-only") Toggle Navigation
53
span(class="icon-bar")
span(class="icon-bar")
span(class="icon-bar")
div(class="wrapper row-offcanvas row-offcanvas-left")
aside(class="left-side sidebar-offcanvas")
section(class="sidebar")
ul(class="sidebar-menu")
li
a(href="/")
i(class="fa fa-upload")
span Upload
li
a(href="/predict")
i(class="fa fa-search")
span Predict
aside(class="right-side")
section(class="content-header")
h1 Learning Curves
section
div(class="box box-primary")
iframe(src="plot.png",style="width: 600px; height:
500px;",frameborder="0")
script(src="js/jquery.js")
script(src="js/bootstrap.min.js")
script(src="js/plugins/jvectormap/jquery-jvectormap-1.2.2.min.js")
script(src="js/plugins/jvectormap/jquery-jvectormap-world-mill-
en.js")
script(src="js/AdminLTE/app.js")
54
CHAPTER 6
RESULTS
6.1 DATASET UPLOAD
55
6.2 UPLOAD RESULT
Fig 6.1 : Upload Result
6.3 PREDICTION
Fig 6.2 : Prediction Screen
56
Fig 6.3 : Predicting Student will not Dropout
Fig 6.4 : Predicting Student will Dropout
57
CHAPTER 7
CONCLUSIONS
The advent of Information Technology and the Internet has lead to vast
amounts of data being gathered and stored in multiple formats by multiple sources. Thus both big corporations as well as Government Agencies are attempting to tap
into these vast troves of data for making better decisions and creating efficient
processes. Several techniques such as Machine Learning, Neural Networks etc,
which are commonly termed as Big Data, are trying to revolutionize the way we
analyze information and are adding real value.
This project was inspired by such technologies. The aim was to create an
objective mechanism for solving the dropout problem that could be used for policy
making. This algorithm could provide an objective solution by identifying
vulnerable students who truly need help and thereby improve retention and
completion rates in schools.
Personally, it was a great opportunity for me to discover an area of
programming that I had wanted to learn for some time now. At the same time getting a chance to solve a real world problem that is vital to our society made it
all the more worthwhile. I humbly admit that the algorithm developed is in no way
perfect but it was a determined attempt from my end to prove what is possible.
Hopefully people after me would take this up and extend it to such a point that it
can be of use to Government Agencies and provide real value to students who are
the final beneficiaries of this system and the future of our nation.
58
CHAPTER 8
REFERENCES
RESEARCH PAPERS
Data Mining: A prediction for Student's Performance Using Classification
Method (World Journal of Computer Application and Technology)
A comparative study for predicting student’s academic performance using
Bayesian Network Classifiers (IOSR Journal of Engineering)
School Dropout across Indian States and UTs: An Econometric Study
(International Research Journal of Social Sciences)
Mining Educational Data to Analyze Students’ Performance (International
Journal of Advanced Computer Science and Applications)
Gender Issues and Dropout Rates in India: Major Barrier in Providing
Education for All (Amirtham, N. S. & Kundupuzhakkal, S. / Educationia
Confab)
Mining Educational Data Using Classification to Decrease Dropout Rate of
Students (INTERNATIONAL JOURNAL OF MULTIDISCIPLINARY
SCIENCES AND ENGINEERING)
Predicting Students Academic Performance Using Education Data Mining
(International Journal of Computer Science and Mobile Computing)
Prediction of student academic performance by an application of data
mining techniques (2011 International Conference on Management and
Artificial Intelligence)
Educational Data Mining: A Review of the State-of-the-Art(Transactions
on Systems, Man, and Cybernetics)
59
SURVEYS
School Drop out: Patterns, Causes, Changes and Policies (UNESCO)
The Criticality of Pupil Teacher Ratio (Azim Preji Foundation)
Survey for Assessment of Dropout Rates at Elementary Level in 21 States
(edCil)
Right to Education Report Card (ANNUAL STATUS OF EDUCATION
REPORT 2011)
How High Are Dropout Rates in India? (Economic and Political Weekly
March 17, 2007)
GOVERNMENT REPORTS
Review, Examination and Validation of Data on Dropout in Karnataka
(Department of Education Government of Karnataka)
Drop – out rate at primary level: A note based on DISE 2003 – 04 & 2004 –
05 data (National Institute of Educational Planning and Administration)
Dropout in Secondary Education: A Study of Children Living in Slums of
Delhi (National University of Educational Planning and Administration)
BOOKS
Data Mining: Concepts and Techniques (Jiawei Han
and Micheline Kamber)
R in Action (Robert I. Kabacoff)
60
LINKS
http://www.wikipedia.org
http://scholar.google.com
https://www.coursera.org/course/ml