group13 kdd cup_report_submitted

7
CS4642 - Data Mining & Information Retrieval Paper Based on KDDCup 2014 Submission Group Members: 100227D - Jayaweera W.J.A.I.U. 100470N - Sajeewa G.K.M.C 100476M - Sampath P.L.B. 100612E - Wijewardane M.M.D.T.K. Group Number : 13 Final Group Rank : 76

Upload: chamath-sajeewa

Post on 14-Jul-2015

45 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Group13 kdd cup_report_submitted

CS4642 - Data Mining & Information

Retrieval

Paper Based on KDDCup 2014 Submission

Group Members:

100227D - Jayaweera W.J.A.I.U.

100470N - Sajeewa G.K.M.C

100476M - Sampath P.L.B.

100612E - Wijewardane M.M.D.T.K.

Group Number : 13

Final Group Rank : 76

Page 2: Group13 kdd cup_report_submitted

Description of Data In this competition, five data files are available for competitors. They are donations

(contains information about the donations to each project. This is only provided for projects

in the training set), essays (contains project text posted by the teachers. This is provided for

both the training and test set), projects (contains information about each project. This is

provided for both the training and test set), resources (contains information about the

resources requested for each project. This is provided for both the training and test set) and

outcomes (contains information about the outcomes of projects in the training set). Before

starting the knowledge discovery process provided data have been analyzed.

First of all number of data records in each file has been counted to get an idea about the

amount of data available. Projects file has 664098 records, essays file has 664098 records,

outcomes file has 619326 records, resources file has 3667217 records and donations file has

3097989. Our next task was to identify the criterion which is used to differentiate test data

from training data. After reading the competition details we realized that projects after

2014-01-01 belongs to test data set and projects before 2014-01-01 belongs to training data

set. According to that 619326 projects are available for training set and remaining amount

(44772) of projects are for training set. For each of the project in training set, project

description, essay data, resources requested, donations provided and outcome are given. For

each of the project in test set, project description, essay data, resources requested are given.

Data Imbalanced Problem After having a brief understanding of data provided we started to analyze training set.

When we draw a graph between the project’s posted dates and “is_exciting” attribute, we

realized that there are no exciting projects before April 2014. Graph was completely

skewed to the right side.

This leads to a data imbalanced problem as number of exciting projects is very small

compare to the number of non-exciting projects (exciting - 5.9274%). Histogram of

exciting and non-exciting projects was as follows.

Page 3: Group13 kdd cup_report_submitted

In competition forum there was an explanation for this problem. It said that organization

might not keep track of some of the requirements needed to decide ‘is_exciting’ for the

projects before 2010. Therefore we thought that classification given in outcomes file before

2010 may not correct and we decided to use down sampling technique to handle

imbalanced data (remove projects before 2010). It is true that valuable information may get

lost when projects are removed. But accuracy obtained by removing that data outweigh the

loss of information. Therefore we were able to obtain higher accuracy by down sampling

the given data. All the classifiers that we have used performed well after removing projects

before 2010.

Preprocessing Data First we analyzed characteristics of mining data using statistical measurements. Using the

data frame describe method we calculated number of records, mean, standard deviation,

minimum value, maximum value and the quartile values for each attribute. Given below is

a statistical measurement of two attributes.

We were able to get an idea about the distribution of attributes using these statistical

measurements.

Page 4: Group13 kdd cup_report_submitted

Filling Missing Values Initially we used pad method (propagate last valid observation forward) to fill missing

values of all the attributes. But we realized that we can achieve high accuracy by selecting a

filling method based on the type of the attribute. To do that first we calculate the percentage

of missing values. It was as follows,

Highest amount of missing values percentage was for secondary focused subject and

secondary focused area. This is because some projects may have only primary focus area

and primary focus subject. We decided to fill missing secondary values with their

respective primary values. Also we used linear interpolation for numeric values and for

other attributes we used pad method. Later when we tune up classifiers we changed the

method from pad to backfill (use next valid observation) as it obtained a higher accuracy

than pad.

Remove Outliers When we analyzed data, outliers were detected in some of the attributes. We used scatter

plots to identify outliers. There were outliers in cost related attributes and we replaced them

with the mean value of that attribute. Given below is outlier analysis of cost attribute,

Page 5: Group13 kdd cup_report_submitted

Red circle value can be considered as an outlier as it has a really huge value than other

values. These outliers have caused a lot of problems when we discretize data. To identify

outliers in resources, we used inter quartile range as a measurement.

Label Encoding We did not use all the attributes for predictions. We focused more on repetitive features as

they will help more to the classifier to make predictions. Most of these repetitive

features/attributes have string values rather than numerical values. Available classifiers do

not accept string values for features. So we used label encoder to transform those string

values to integer values between 0 and n-1, n being the number of different values a feature

can take.

But classifiers expect continuous input and may interpret the categories as being ordered

which is not desired. To make the categorical features to features that can be used with

scikit classifiers we used one-hot encoding. Encoder transformed each categorical feature

with k possible values into k binary features, with only one active for particular sample.

This improved the performance of classifiers to greater extent. For an example SGD

classifier obtained about 0.55 ROC score without hot encoding and with encoding it

obtained about 0.59 ROC score.

Continues Values Discretization Project attributes such as school longitude, school latitude, zip code and total cost cannot be

directly used for predictions as they are less likely to be repetitive. But this information

cannot be eliminated as they may help to get decisions for classifiers. To make these

attributes more repetitive we used discretization. We put these continuous values into bins

and used the bin index as the attribute. For an example we used discretization for longitude

and latitude and divided projects into five regions (bins) and used region id instead of using

longitude and latitude. Discretization results for total cost attribute as follows,

Page 6: Group13 kdd cup_report_submitted

We applied the same concept for cost related attributes, item count for project, total price of

items per project, number of projects per teacher etc.

This has improved the repetitiveness of attributes to a greater extent and more useful

information has been discovered which can be used by the classifier.

Attribute Construction Some of the features given in data files cannot be used directly due to various reasons (most

of the times they are highly non repetitive). We used some of these features to construct

new features by combining multiple features or transforming one to another. Given below

is the list of derived attributes.

1. Month- posted date of the project was given but it is less repetitive. We derived

month attribute from the posted date and used it for prediction

2. Essay length- for each project corresponding essay was given but it cannot be used

directly for prediction. Therefore we calculated the length of the each essay after

removing extra spaces within the essay text and used it as an attribute.

3. Need statement length

4. Projects per teacher- we calculated number of projects per teacher by grouping the

projects with ‘teacher_acctid’ and used it as an attribute

5. Total items per project- we calculated total number of items requested per each

project from the details provided in resources file and used it as an attribute

6. Cost of total items per project- we calculated total cost of items requested per each

project from the details provided in resources file and used it as an attribute

Several other derived attributes such as date, short description length has been considered

but they did not yield a significant performance improvement.

Model Selection and Evaluation We have used three classifiers during the project. First we used decision tree classifier, then

we used logistic regression and finally we used SGD (stochastic gradient decent) classifier.

We started with tree classifier as it was easy to use. To evaluate the performance of

classifiers initially we used the cross validation technique. But later we realized that

competition is using ROC (area under the curve) score for evaluations. So we also used

ROC scores to evaluate the performance of the classifiers. As we had several choices for

classifiers we read several articles about the usage of classifiers. From them we realized

that decision tree normally does not perform well when there is data imbalance problem

and logistic regression was used instead of that.

Logistic regression was performed well with the given data and it achieved about 0.61 ROC

score. To improve the accuracy further more we used SGD classifier (logistic regression

with SGD training). On one hand it is more efficient than the logistic regression so that

predictions can be done in less amount of time. On the other hand it achieved higher

accuracy than the regression classifier. With default parameters for SGD classifier we were

able to achieve about 0.635 ROC score. To tune up the SGD classifier (to find best values

Page 7: Group13 kdd cup_report_submitted

for the parameters) we performed a grid search and found optimum values for the number

of iterations, penalty, shuffle and alpha parameters. Using those values we were able

improve the accuracy up to 0.64 ROC score.

Ensemble Methods We tried to use boosting algorithm to improve the performance of classifier. Among the

methods available we used “ada boost” method (AdaBoostClassifier) for that.

Implementation provided by scikit library only supports decision tree classifier and SGD

classifier. So we were not able to use logistic regression directly. Instead we tried to use

SGD classifier with boosting algorithm. But accuracy was increased only by an

insignificant amount.

Further Improvements Essays data contains huge amount of data but they were not used during the predictions

apart from the essay length. We tried to extract essay data using TfidVectorizer but it was

not successful due to memory constraints. As an alternative we tried hashing methods but it

reduced the accuracy of the essay data. We think that accuracy of the classifier may

improve further if some features from the essay data are included in training data. Also use

of ensemble methods will definitely improve the accuracy of predications.

Support Libraries Used We used ‘Pandas’ data analysis library to generate data frames from the provided comma-

separated values files which can be used with other data analysis and modeling tools which

we used. Other than that we used functions provided with ‘Pandas’ library for generating

bins in order to discretize the attributes with less repetitive values and merging data frames

from several data sources.

Then we used ‘NumPy’ extension library in order to generate multidimensional arrays

using ‘Pandas’ data frames and data series to make it easy to access certain ranges of data

(i.e. separate the indices of training data set from test data set) and locate some properties of

data like median and quartiles. Also when combining derived attributes with existing

attributes functions provided with ‘NumPy’ library was useful.

‘Scikit-learn’ machine learning library was the library we used to integrate data analysis,

preprocessing, classification, regression and modeling tools into our implementations. From

the various tools provided with ‘Scikit-learn’ library we used preprocessing tools like

‘Label Encoder’ and ‘One Hot Encoder’, ‘Standard Scalar’ and text feature extraction tools

classification tools like ‘Decision Tree Classifier’, ‘SGD Classifier’ and ‘Logistic

Regression’, model selection and evaluation tools like ‘Grid Search’, ensemble tools like

‘AdaBoost Classifier’ and metrics like ‘roc_auc_score’ to compute area under the curve

(AUC) from prediction scores as mentioned above.