house price estimation as a function fitting problem with using ann approach
TRANSCRIPT
Machine Learning and Pattern Recognition
Term Project
Lecturer: PhD. Asst. Prof. Cemal Okan ŞAKAR
Student: Yusuf Ziya UZUN
Lesson: CMP5130 - Machine Learning and Pattern Recognition (Fall 2014)
Subject: House Price Estimation as a Function Fitting Problem with using ANN Approach
Introduction
As we implemented how to use cross validation and k-Fold approaches in our datasets in first
homework, my aim here is how to use these in different machine learning algorithms, rather than how
to implement it. So in this project I want to introduce you Neural Network Toolbox that Matlab provides
us.
For this purpose I picked Artificial Neural Network (ANN) algorithm with UCI Housing dataset. As
a brief introduction, I would like to give some links below here:
- Neural Network Toolbox for Matlab: http://www.mathworks.com/products/neural-network/
- UCI Housing Dataset: https://archive.ics.uci.edu/ml/datasets/Housing
After giving a small explanation about tool I want to explain our dataset and I want to make
some experiments over dataset with using Neural Net Toolset components. I am happy to say that we
will be able to visualize many things in our solutions.
So basically our way to go is like this:
1- Dataset Information
2- Toolbox Information
3- Experiments
4- Comparisons
5- Conclusion
Also I want to mention that my main aim is giving a basic idea about NNT and using it with ANN
algorithm, so this term project only covers some topics that we are going to use for our Housing dataset.
I will code in Matlab and give some details about that code also.
1- Dataset Information
We have data of housing values with 13 properties that gives us different information about
houses and their environments. First 13 attribute are our input data (features) and 14th data will be our
target data (output). We have 506 items and have no missing values.
1. Per capita crime rate per town
2. Proportion of residential land zoned for lots over 25,000 sq. ft.
3. proportion of non-retail business acres per town
4. 1 if tract bounds Charles river, 0 otherwise
5. Nitric oxides concentration (parts per 10 million)
6. Average number of rooms per dwelling
7. Proportion of owner-occupied units built prior to 1940
8. Weighted distances to five Boston employment centres
9. Index of accessibility to radial highways
10. Full-value property-tax rate per $10,000
11. Pupil-teacher ratio by town
12. 1000(Bk - 0.63)^2
13. Percent lower status of the population
14. Median value of owner-occupied homes in $1000's (output)
As you can see we have a regression problem and some different kind of input parameters. As a
result we need to estimate housing median price for given inputs. Neural networks are basically good fit
for non-linear problems. Because you can use some number of neurons to make a better fit in given
non-linear problem set.
2- Toolbox Information
First we start with opening NNT window by writing this command: nnstart
After you write this command, a start window must be appear. We will going to use fitting app
which I highlighted it with yellow color. You can also use nftool command to open it instantly.
Here we select our dataset, but we need to introduce each data item as a column and each
property as row in step 5 as shown. Now we need to select our cross validation set sizes in next step:
Here Training set changes by changing Validation and Test Sets. So for an introduction I leave
those defaults. Next we are going to define our network architecture:
As I showed above you can see we have two layers, one is hidden layer, and other one is linear
regression layer which gives us the output value (predicted value).
After everything we can choose the training algorithm and train our divided dataset. Also we can
train dataset multiple times with randomly picked cross validation sets. So each train will have different
results.
After we click train button, you will see a window that shows you some summary and results as
below:
Here we can see some pretty nice information about our tests such as performance, training
state, error histogram and regression plots.
And also there will be some information about error and correlation on main window:
Performance Plot Example:
Training State Plot:
Error Histogram Plot:
Regression Plot:
In next steps we can test our network again if we think it is not a good fit. We can go back to
previous steps and change data dimension or increase/decrease our network size.
After that step we also can see our algorithm visually by Simulink diagram. For this, you need to
click Simulink diagram button in Deploy Solution window.
Now, a Simulink window must be appeared and you should see this basic visualization of our
network:
Now click the down arrow in the Function fitting Neural Network box which I highlighted with
yellow color. You will see this:
When you double click Layer 1 you will get more information:
Here is also how our hidden layer visually looks like:
And our sigmoid transfer function:
Here is the Layer 2 visually:
Simulink is also gives us chance to debug our implementation step by step and lets us to see
simulation of our algorithm. Here, I only introduced the visually generated implementation of our
Neural Network algorithm.
Now let us start with some experiments and use some of these tools for getting better outputs.
3- Experiments
Here, we will inspect how changing cross validation set sizes, training algorithms, hidden layer
neuron sizes effect our accuracy. Then we will use these results for our comparisons. Last we try to
conclude some results from our outputs. We will use randomly divided data sets in each experiment
that’s reason will be explained in conclusion section.
3.1- Changing Training Size, Validation Size, Test Size
Now, let’s change our data set sizes over percentage and get some results. In these experiments
we will keep the training algorithm and neuron sizes same. Our training algorithm will be held to
Levenberg-Marquardt and our neuron size is going to be 10 as default.
3.1.1- Training Size: 50%, Validation Size: 25%, Test Size: 25%
Training Performance: 13.1571
Validation Performance: 16.9545
Test Performance: 34.1554
3.1.2- Training Size: 60%, Validation Size: 20%, Test Size: 20%
Training Performance: 4.887432
Validation Performance: 24.091745
Test Performance: 27.686587
3.1.3- Training Size: 80%, Validation Size: 10%, Test Size: 10%
Training Performance: 5.254130
Validation Performance: 6.768036
Test Performance: 19.687765
3.2- Changing Training Algorithm
After trying different data sizes, we will change the training function to see the effect of function
on out dataset. In this case, we need to take constant data size to make comparison between functions.
So, let us take 80% for training and 10% for validation. And take the number of neurons to 10.
3.2.1- Levenberg-Marquardt (trainlm) Function
We already tried this method as default training function in data size comparisons. So, we pass
here intentionally as already done. Here are the performance results as reminder:
Training Performance: 5.254130
Validation Performance: 6.768036
Test Performance: 19.687765
3.2.2- Scaled conjugate gradient back propagation (trainscg) Function
Training Performance: 21.853845
Validation Performance: 50.703653
Test Performance: 20.041648
3.2.3- Adaptive Gradient Descent with Momentum back propagation (traingdx) Function
Training Performance: 20.875673
Validation Performance: 19.159098
Test Performance: 10.541039
3.3- Changing Hidden Layer Neuron Size
After experimenting data sets and training functions, we are going to change the number of
neurons in our hidden layer. So we will see the relation between accuracy and neuron size. Now, let’s
pick 80% for training and 20% for validation again and also let’s pick the Adaptive Gradient Descent with
Momentum back propagation (traingdx) function for training algorithm as default parameters.
3.3.1- Selecting 5 Neurons
Training Performance: 59.678844
Validation Performance: 50.404263
Test Performance: 100.167308
3.3.2- Selecting 10 Neurons
We already did this experiment in section 3.2.3 with same parameters. So this experiment
intentionally left blank. Just leaving same performance results as reminder here:
Training Performance: 20.875673
Validation Performance: 19.159098
Test Performance: 10.541039
3.3.3- Selecting 15 Neurons
Training Performance: 13.784050
Validation Performance: 29.685621
Test Performance: 16.404043
4- Comparisons
4.1- Comparison of Data Sizes
As you can see in below table as training set increases, our total performance value is decrease.
Performance is in the best point when it reaches to zero. So, this means performance is increasing as
long as its value decreases to zero. As we know from cross validation techniques more training data
makes our algorithm more accurate, but it also may cause overfitting problems. Therefore, we should
keep the size of validation set in necessary proportion.
For performance calculations we always used the Mean Squared Errors (MSE) to calculate the
cost function:
In first data set division we see that our training and validation sets are OK, but in test set we
have big number of performance difference. That’s because we couldn’t give enough proportion of data
to our function to train our network.
Also in the second row, we have pretty nice increased training set performance but our
validation and test sets are still away from the training accuracy. So this is still a problem for accuracy of
testing.
At third, that the performance of training and validation sets are very close each other and also
test set performance increased well. By looking at training and validation sets, we can say that our
algorithm learned well with these proportions of data. So its effect of test is obviously positive.
We can also check the regression plots of each data divisions and see how it is fitted to target.
We see that best fit is in 80-10-10 data division.
Data Sets / Performance Training Validation Test Total Perf.
50 – 25 – 25 13.1571 16.9545 34.1554 19.356025
60 – 20 – 20 4.887432 24.091745 27.686587 13.2881256
80 – 10 – 10 5.254130 6.768036 19.687765 6.8489
4.2- Comparison of Training Functions
Levenberg-Marquardt algorithm (LMA) interpolates between the Gaussian-Newton Algorithm
and the method of Gradient Descent (GD). Generally LMA is much faster than GD, because it converges
faster than GD algorithmically. LM algorithm achieves lower precision in terms of predictive
performance when compared with GD algorithms. An interesting observation is that LMA with the lower
MSE value for the training set does not result in better precision of test set prediction as compared with
Adaptive GD.
Gradient descent algorithm converges slowly by design. For this purpose we added to it
momentum effect, so it reduces the risk of getting stuck in a local minimum, converges faster with less
zig-zag in cost function. Also we added it to online learning approach to make its learning rate fits
better.
In above table in training and validation sets LMA is most performed than other algorithms by
far. But also it is seen that LMA test performance worse than Adaptive GD with momentum method. So
LMA is looks like outperformed in total but has less precision than GD method. It means that GD has less
false positives. Thus, GD method got better accuracy in test dataset.
At the other hand, we see that Scaled Conjugate Gradient (SCG) method has not good
performance results. This is because of our validation checks are same for each algorithm. I intentionally
left number of failed iterations in a row to 6 as defined default. You can also try with bigger numbers by
setting its value (net.trainParam.max_fail) and you will see it is performing well.
Function / Performance Training Validation Test Total Perf.
trainlm 5.254130 6.768036 19.687765 6.8489
trainscg 21.853845 50.703653 20.041648 24.5576
traingdx 20.875673 19.159098 10.541039 19.6706
4.3- Comparison of Hidden Layer Neuron Size
Finding a good number of hidden layer neurons is one of important ANN problem. Small number
of neurons might give you faster results but bad accuracies. On the other hand increasing number of
neurons can give you better accuracy but more time and space complexity. Bigger number of neurons
also cause to complexity of algorithms. Small number of neurons might be responsible of underfitting,
but more neurons than necessity is reason to overfitting also. So, here we have to find an optimal value
of number of neurons in the hidden layer. While we decide the optimum value we have to balance the
tradeoff carefully.
As we can see in performance table, 5 neurons for this dataset are quite less. Training and
validation sets pretty bad with compared to other neuron sizes. Obviously, there is a lack of sending
information over network. Because our Adaptive GD algorithm trying to minimize our cost function, but
our weight parameters cannot be able to carry bigger mass values. This is causes underfitting problem.
When we compared 10 and 15 neurons performances, we see that 15 neurons is getting better
results in training set, but not in test set. And also we see that training and validation performances for
15 neurons are very different. Its reason is, too many weight parameters causing to the overtraining. We
have 13 input parameters in our dataset but we defined 15 neurons in hidden layer.
Neuron Size / Performance Training Validation Test Total Perf.
5 Neurons 59.678844 50.404263 100.167308 62.8002
10 Neurons 20.875673 19.159098 10.541039 19.6706
15 Neurons 13.784050 29.685621 16.404043 15.6362
5- Conclusion
As we seen in the visualized plots of different parameters and algorithms, there is no best
choice. And also we can say that different dataset divisions may cause totally different results (rarely).
For sure, we can change a lot of parameters in these algorithms and try to cross validate all of them. But
for the sake of this project, we only analyzed most of them with default parameters. As you know,
project experiments and results tightly coupled with given dataset. So, we can simply remember the
there is no free lunch theorem.
In most of neural network linear fitting problems we have to resolve some cross validation
problems for getting good fitting results whether it is a simple or complex dataset. Some of them are:
Hidden Layer Neuron size
Good Number of iterations for preventing the underfitting and overfitting
Time, Space, Accuracy tradeoff
Algorithm based predetermined values (learning rate, bias values, etc.)
For our house price estimation dataset we tried different data sizes, neuron sizes, algorithms.
Instead of taking one predefined data divisions, we used randomly divided different data divisions for
every experiment. So, now we are able to conclude different results from each experiment, and we saw
that it is not affecting the predicted results as much. We tried to find the best fit for our target results.
As far as we made experiment, we inspected that algorithms have different accuracies in training,
validation and test sets.
It’s seen that more training data size gives us more accuracy, but it also needs to be divided into
good proportions. Otherwise, it will lead to over training problem.
Another important criteria that we see is hidden layer’s neuron size. Giving less number of
neurons definitely shows us the dramatically decrease of our performance, means that neurons cannot
be able to carry enough value to generate better results.
From training functions point of view, we can easily say one can perform better at something
and other is at another. For ex, Adaptive GD is good at test set performance but very slow compared to
MLA. There are some tradeoffs (like time-accuracy) in choice of algorithms.
For this dataset, I would go with 10 hidden layer neurons, 80% training set, 10% validation set,
10% test set, and Adaptive GD with momentum method. Because of the dataset is not so big, I would
pick accuracy rather than timing.
Of course, it would be very good to make other experiments with different parameters or
before applying the ANN we could try some dimensionality reduction methods (like PCA, LCA or some
feature selection). Also we could have tried different resampling methodologies like one in one out,
bootstrapping. These all would give us very good information. For now, we are be aware of these
methods, but unfortunately not be able to accomplish.
References:
https://archive.ics.uci.edu/ml/datasets/Housing
http://www.mathworks.com/help/nnet/examples/house-price-estimation.html
http://www.mathworks.com/help/nnet/gs/fit-data-with-a-neural-network.html
http://www.mathworks.com/help/nnet/ug/choose-a-multilayer-neural-network-training-
function.html
http://www.mathworks.com/help/nnet/ref/traingdx.html
http://www.mathworks.com/help/nnet/ref/trainlm.html
http://www.mathworks.com/help/nnet/ref/trainscg.html
http://radio.feld.cvut.cz/matlab/toolbox/nnet/trainlm.html
http://radio.feld.cvut.cz/matlab/toolbox/nnet/traingdx.html
http://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm
http://en.wikipedia.org/wiki/Gradient_descent
http://alumni.cs.ucr.edu/~vladimir/cs171/nn_summary.pdf
http://aix1.uottawa.ca/~isoltani/ANN.pdf