predicting rainfall using ensemble of ensembles

Predicting rainfall using ensemble of Ensembles.∗†

Prolok Sundaresan, Varad Meru, and Prateek Jain‡

University of California, Irvine{sunderap,vmeru,prateekj}@uci.edu

Abstract

Regression is an approach for modeling the relationship between data Xand the dependent variable y. In this report, we present our experimentswith multiple approaches, ranging from Ensemble of Learning to DeepLearning Networks on the weather modeling data to predict the rainfall.The competition was held on the online data science competition portal‘Kaggle’. The results for weighted ensemble of learners gave us a top-10ranking, with the testing root-mean-squared error being 0.5878.

1 Introduction

The task of this in-class Kaggle competition was to predict the amount of rainfallat a particular location using satellite data. We wanted to try various algorithmsand ensembles for regression to experiment and learn. The report is structuredin the following manner. The section 2 describes the dataset contents and thelatent structure found using latent variable analysis and clustering. This wasdone by Prolok and Prateek. The section 3 describes various models used inthe project in detail. The Neural Network/Deep Learning section was doneby Varad. Random Forests was done by Prolok and Prateek. The work onGradient Boosting was done by Prateek and Varad. The section 4 describedthe ensemble of ensembles technique used by us. The ensemble sits on top ofdifferent ensembles and learners which were done in section 3. The work on thefinal ensemble was done by all the three members. The section 5 presents ourlearning and conclusion.

2 Understanding The Data

Visualizing the data was a difficult task since the data was in 91 dimensions.In order to look for patterns in the data and visualized it, we applied SVDtechnique to reduce the dimensionality of the features to 2 principle dimensions.Then we applied k means clustering with k=5 on the data with 91 dimensions

∗The online competition is available at the Kaggle website https://inclass.kaggle.com/

c/how-s-the-weather. The name of the team was skynet†This work was does as a part of the project for CS 273: Machine Learning, Fall 2014,

taught by Prof. Alexander Ihler.‡Prolok Sundaresan: Student# 66008474, Varad Meru: Student# 26648958, Prateek Jain:

Student# 28321844

1

https://inclass.kaggle.com/c/how-s-the-weather




http://sli.ics.uci.edu/Classes/2015W-273a

http://www.ics.uci.edu/~ihler/

and plotted the assignments in the 2 dimensional transformed feature space.We saw patterns in the data. Especially some points were densely clustered andsome were sparse.

To visualize it better, we transformed the feature in 3 dimensional space,with the first 3 principle components, and saw that the points were clusteredaround 3 planes.

Figure 1: Visualizing the data in 3 dimensions

3 Machine Learning Models

3.1 Mixture of Experts

As seen from our visualization in Figure 1, we could identify two highly denseareas of the feature data on either side of a region of sparsely distributed data.The idea behind using the mixture of experts approach was, that intuitively, itwould be difficult for a single regressor to fit the dataset, since the distributionis non-uniform. We decided to split the data into clusters. To cluster the data,we used several initialization of the k means algorithm with the kmeans++. Weused number of clusters as one of the parameters of our model which we triedto change.

Since each of the clusters got a subset of a points from the original dataset,number of data points per cluster was not a very large number. Our concernwith this was that any model we chose would overfit the data in its cluster.Therefore, we used the ensemble method of gradient boosting for each of theclusters. Since, in gradient boosting, we start with an underfitting model and

2

(a) Cluster assignments of Data Points

(b) Mixture of Experts Error

Figure 2: Visualizing the principle components of Data

3

then gradually add complexity, the chances of overfitting would be less in thismodel. We decided to use Decision stumps as our regressors for the boostingalgorithm.

For evaluating the prediction for the validation split and the test data, wefirst check which cluster the data point belongs to. We did this, by creating a Knearest neighbor classifier on the center of the 3 clusters created in the previousstep. Then, the classifier predicts the cluster assignment for each test point,and we use the array of boosting regressors corresponding to that cluster on thedata point, to get its corresponding prediction.

The parameters of the model we modified were the number of clusters andthe number of regressors used for boosting. We found that though the test errorreduced considerably on increasing the regressors for boosting, the validationerror increased after a certain point as can be seen from Figure 3. We gotminimum validation error for 700 regressors.

3.2 Neural Networks

We implemented various types of neural networks, ranging from single layernetworks to 3-layer sigmoidal neural networks.

Single Layer Network

Figure 3: Single Layer Architecture.

We build the neural network using the MATLAB’s Neural-Network-Toolkitand PyBrain library implemented in Python. For the MATLAB implementa-tion, there were various runs made for different number of neurons in the hiddenlayer. The architecture of the neural network can be seen in Figure 3. The Fig-ure 4 show the train-test-validation plots for different network architectures.The dataset was distributed into 70% (Training), 20% (Validation) and 10%(Testing) section for the neural network to run. The subsection 3.4 shows theperformance of the models learned. It was seen that the neural networks startedto overfit as the number of neurons were increased more than 40.

# of Neurons Training Error (RMSE) Testing Error (RMSE)10 0.5986 0.6134120 0.5875 0.6130150 0.5852 0.62889

Table 1: RMSE Error rates for different network architectures.

It was observed that the learner could not learn very accurately as the dataa lot as the data was not much for the neural network to learn on.

4

(a) Train-Validation-Test error plot for 10neuron hidden layer

(b) Error distribution histogram for 10 neuronhidden layer

(c) Train-Validation-Test error plot for 20neuron hidden layer

(d) Error distribution histogram for 20 neuronhidden layer

(e) Train-Validation-Test error plot for 50neuron hidden layer

(f) Error distribution histogram for 50 neuronhidden layer

Figure 4: Plots of various Train-Validation-Test error for number of neurons =[10, 20, 50]

5

Deep Networks

For this project, we tried using deep networks as well. The deep network wasmade using PyBrain. We tried using different activation functions and archi-tectures to understand how deep networks would work. The architecture shownin Figure 5 had 3 layers - visible later contains 91 neurons, the first hiddenlayer (tanh) had 91 neurons, the second hidden layer (sigmoid) had 50 neu-rons, the third hidden layer (sigmoid) had 20 neurons, and the output layerhad 1 linear node. The testing error was 0.83643 was very high compared toother approaches. We concluded that the network was learning the data well,but was overfitting.

Inputlayer

Hidden layer(Hyperbolic

Tangent)

Hiddenlayer(Sigmoid)

y1

y2

y3

Outputlayer

3.3 Gradient Boosting

In parallel, we worked on training the gradient boosting model with varyingparameters to get the best fit for the data. We started with basic decisionstumps with number of regressors ranging from 1 to 2000. We also varied themaximum Depth for the decision tree used as the regression model from 3 to 7.We used alpha 0.9 for our algorithm. We observed that we got best performancewith 2000 boosters and depth as 7.

3.4 Random Forests

Several aspects of Random Forest technique was explored. The major funda-mental behind Random Forest is to take a model, that overfits, the data, thenuse feature and data bagging to bring down the complexity to fit the data bet-ter. The usual model that is used in Random Forest is a high depth Regressiontree. We tried to explore other models, that overfitted the data.

The first option was to consider simple linear regression with feature trans-formation. The data from X1 was transformed into X1 and X12 features and

6

Figure 5: Train and Test error plot for Gradient Boosting vs number of learners

linear regression was done on that. Significantly better results were obtained inthis transformation( a test error of 0.4322 compared to 0.4181) , but it signifi-cantly worsened with an addition of X13 features to the feature list. This wasused as the regressor for the Random Forests, but the results were better fora Tree Regressor. The major take away from this analysis was the use of X22

features into the feature list for tree regression. Several other regressors werealso tried like knn regressor was used, but tree regressor came out on top.

Since Decision Tree regression was significantly better than linear regressionin Random Forest, we decided to proceed with that with the X22 features alsoin place(a total of 182 features). nFeatures was chosen as 150, and the depth wasset as 13,14,15,16,17, of which a maxDepth of 14 obtained optimal performance.150 decision trees were learned and the optimum results were obtained for 90learners.

Learner Training Error (MSE) Testing Error (MSE)Linear Regressor 0.4068 0.4243Linear Regressor with X12 feature 0.3996 0.4140Tree Regressor 0.1951 0.3822

Table 2: MSE Error rates for Random Forests

4 Ensemble of all Learners

At the end, since we trained a lot of learners separately, some of which wereensembles themselves, we thought of aggregating the results of the learnersto improve our prediction.We also analyzed the variance between the resultsof our learners, and an average variance of 0.0204 was obtained. Since the

7

variance was noticeable, a weighted average aggregation of the results seemedthe best approach. We chose the model parameters for the best performingmodels from each category to get a consolidated result. The section 4 showsthe architecture of our ensember. Initially, we chose a very simple approach ofassigning all models with the same weights to get a prediction. We got a someimprovement with MSE of 0.5908. We, saw that this was performing just belowour best individual prediction model. So, we decided to bump the weight of ourbest learner in the ensemble. This helped improve our accumulated prediction,providing an MSE of 0.5878.

Figure 6: Ensemble of Learners

5 Conclusion

This project gave a us glimpse on how machine learning techniques are appliedto real world problems. We applied a variety of techniques including neuralnetworks, decision trees, random forests, gradient boosting, kmeans clustering,and PCA. Testing out various parameters of the different learner types helped usidentify where each of the models under-fitted and over-fitted the data. Finally,while modifying the parameters of each model helped us reduce the variance inthe models, we used a final weighted ensemble of various learners to reduce thebias of individual learners.

8

predicting rainfall using ensemble of ensembles

Technology