house price estimation as a function fitting problem with using ann approach

Machine Learning and Pattern Recognition

Term Project

Lecturer: PhD. Asst. Prof. Cemal Okan ŞAKAR

Student: Yusuf Ziya UZUN

Lesson: CMP5130 - Machine Learning and Pattern Recognition (Fall 2014)

Subject: House Price Estimation as a Function Fitting Problem with using ANN Approach

Introduction

As we implemented how to use cross validation and k-Fold approaches in our datasets in first

homework, my aim here is how to use these in different machine learning algorithms, rather than how

to implement it. So in this project I want to introduce you Neural Network Toolbox that Matlab provides

us.

For this purpose I picked Artificial Neural Network (ANN) algorithm with UCI Housing dataset. As

a brief introduction, I would like to give some links below here:

- Neural Network Toolbox for Matlab: http://www.mathworks.com/products/neural-network/

- UCI Housing Dataset: https://archive.ics.uci.edu/ml/datasets/Housing

After giving a small explanation about tool I want to explain our dataset and I want to make

some experiments over dataset with using Neural Net Toolset components. I am happy to say that we

will be able to visualize many things in our solutions.

So basically our way to go is like this:

1- Dataset Information

2- Toolbox Information

3- Experiments

4- Comparisons

http://www.mathworks.com/products/neural-network/

https://archive.ics.uci.edu/ml/datasets/Housing

5- Conclusion

Also I want to mention that my main aim is giving a basic idea about NNT and using it with ANN

algorithm, so this term project only covers some topics that we are going to use for our Housing dataset.

I will code in Matlab and give some details about that code also.

1- Dataset Information

We have data of housing values with 13 properties that gives us different information about

houses and their environments. First 13 attribute are our input data (features) and 14th data will be our

target data (output). We have 506 items and have no missing values.

1. Per capita crime rate per town

2. Proportion of residential land zoned for lots over 25,000 sq. ft.

3. proportion of non-retail business acres per town

4. 1 if tract bounds Charles river, 0 otherwise

5. Nitric oxides concentration (parts per 10 million)

6. Average number of rooms per dwelling

7. Proportion of owner-occupied units built prior to 1940

8. Weighted distances to five Boston employment centres

9. Index of accessibility to radial highways

10. Full-value property-tax rate per $10,000

11. Pupil-teacher ratio by town

12. 1000(Bk - 0.63)^2

13. Percent lower status of the population

14. Median value of owner-occupied homes in $1000's (output)

As you can see we have a regression problem and some different kind of input parameters. As a

result we need to estimate housing median price for given inputs. Neural networks are basically good fit

for non-linear problems. Because you can use some number of neurons to make a better fit in given

non-linear problem set.

2- Toolbox Information

First we start with opening NNT window by writing this command: nnstart

After you write this command, a start window must be appear. We will going to use fitting app

which I highlighted it with yellow color. You can also use nftool command to open it instantly.

Here we select our dataset, but we need to introduce each data item as a column and each

property as row in step 5 as shown. Now we need to select our cross validation set sizes in next step:

Here Training set changes by changing Validation and Test Sets. So for an introduction I leave

those defaults. Next we are going to define our network architecture:

As I showed above you can see we have two layers, one is hidden layer, and other one is linear

regression layer which gives us the output value (predicted value).

After everything we can choose the training algorithm and train our divided dataset. Also we can

train dataset multiple times with randomly picked cross validation sets. So each train will have different

results.

After we click train button, you will see a window that shows you some summary and results as

below:

Here we can see some pretty nice information about our tests such as performance, training

state, error histogram and regression plots.

And also there will be some information about error and correlation on main window:

Performance Plot Example:

Training State Plot:

Error Histogram Plot:

Regression Plot:

In next steps we can test our network again if we think it is not a good fit. We can go back to

previous steps and change data dimension or increase/decrease our network size.

After that step we also can see our algorithm visually by Simulink diagram. For this, you need to

click Simulink diagram button in Deploy Solution window.

Now, a Simulink window must be appeared and you should see this basic visualization of our

network:

Now click the down arrow in the Function fitting Neural Network box which I highlighted with

yellow color. You will see this:

When you double click Layer 1 you will get more information:

Here is also how our hidden layer visually looks like:

And our sigmoid transfer function:

Here is the Layer 2 visually:

Simulink is also gives us chance to debug our implementation step by step and lets us to see

simulation of our algorithm. Here, I only introduced the visually generated implementation of our

Neural Network algorithm.

Now let us start with some experiments and use some of these tools for getting better outputs.

3- Experiments

Here, we will inspect how changing cross validation set sizes, training algorithms, hidden layer

neuron sizes effect our accuracy. Then we will use these results for our comparisons. Last we try to

conclude some results from our outputs. We will use randomly divided data sets in each experiment

that’s reason will be explained in conclusion section.

3.1- Changing Training Size, Validation Size, Test Size

Now, let’s change our data set sizes over percentage and get some results. In these experiments

we will keep the training algorithm and neuron sizes same. Our training algorithm will be held to

Levenberg-Marquardt and our neuron size is going to be 10 as default.

3.1.1- Training Size: 50%, Validation Size: 25%, Test Size: 25%

Training Performance: 13.1571

Validation Performance: 16.9545

Test Performance: 34.1554

3.2- Changing Training Algorithm

After trying different data sizes, we will change the training function to see the effect of function

on out dataset. In this case, we need to take constant data size to make comparison between functions.

So, let us take 80% for training and 10% for validation. And take the number of neurons to 10.

3.2.1- Levenberg-Marquardt (trainlm) Function

We already tried this method as default training function in data size comparisons. So, we pass

here intentionally as already done. Here are the performance results as reminder:




3.2.2- Scaled conjugate gradient back propagation (trainscg) Function




3.2.3- Adaptive Gradient Descent with Momentum back propagation (traingdx) Function


3.3- Changing Hidden Layer Neuron Size

After experimenting data sets and training functions, we are going to change the number of

neurons in our hidden layer. So we will see the relation between accuracy and neuron size. Now, let’s

pick 80% for training and 20% for validation again and also let’s pick the Adaptive Gradient Descent with

Momentum back propagation (traingdx) function for training algorithm as default parameters.

3.3.1- Selecting 5 Neurons





We already did this experiment in section 3.2.3 with same parameters. So this experiment

intentionally left blank. Just leaving same performance results as reminder here:








4- Comparisons

4.1- Comparison of Data Sizes

As you can see in below table as training set increases, our total performance value is decrease.

Performance is in the best point when it reaches to zero. So, this means performance is increasing as

long as its value decreases to zero. As we know from cross validation techniques more training data

makes our algorithm more accurate, but it also may cause overfitting problems. Therefore, we should

keep the size of validation set in necessary proportion.

For performance calculations we always used the Mean Squared Errors (MSE) to calculate the

cost function:

In first data set division we see that our training and validation sets are OK, but in test set we

have big number of performance difference. That’s because we couldn’t give enough proportion of data

to our function to train our network.

Also in the second row, we have pretty nice increased training set performance but our

validation and test sets are still away from the training accuracy. So this is still a problem for accuracy of

testing.

At third, that the performance of training and validation sets are very close each other and also

test set performance increased well. By looking at training and validation sets, we can say that our

algorithm learned well with these proportions of data. So its effect of test is obviously positive.

We can also check the regression plots of each data divisions and see how it is fitted to target.

We see that best fit is in 80-10-10 data division.

Data Sets / Performance Training Validation Test Total Perf.

50 – 25 – 25 13.1571 16.9545 34.1554 19.356025

60 – 20 – 20 4.887432 24.091745 27.686587 13.2881256

80 – 10 – 10 5.254130 6.768036 19.687765 6.8489

4.2- Comparison of Training Functions

Levenberg-Marquardt algorithm (LMA) interpolates between the Gaussian-Newton Algorithm

and the method of Gradient Descent (GD). Generally LMA is much faster than GD, because it converges

faster than GD algorithmically. LM algorithm achieves lower precision in terms of predictive

performance when compared with GD algorithms. An interesting observation is that LMA with the lower

MSE value for the training set does not result in better precision of test set prediction as compared with

Adaptive GD.

Gradient descent algorithm converges slowly by design. For this purpose we added to it

momentum effect, so it reduces the risk of getting stuck in a local minimum, converges faster with less

zig-zag in cost function. Also we added it to online learning approach to make its learning rate fits

better.

In above table in training and validation sets LMA is most performed than other algorithms by

far. But also it is seen that LMA test performance worse than Adaptive GD with momentum method. So

LMA is looks like outperformed in total but has less precision than GD method. It means that GD has less

false positives. Thus, GD method got better accuracy in test dataset.

At the other hand, we see that Scaled Conjugate Gradient (SCG) method has not good

performance results. This is because of our validation checks are same for each algorithm. I intentionally

left number of failed iterations in a row to 6 as defined default. You can also try with bigger numbers by

setting its value (net.trainParam.max_fail) and you will see it is performing well.

Function / Performance Training Validation Test Total Perf.

trainlm 5.254130 6.768036 19.687765 6.8489

trainscg 21.853845 50.703653 20.041648 24.5576

traingdx 20.875673 19.159098 10.541039 19.6706

4.3- Comparison of Hidden Layer Neuron Size

Finding a good number of hidden layer neurons is one of important ANN problem. Small number

of neurons might give you faster results but bad accuracies. On the other hand increasing number of

neurons can give you better accuracy but more time and space complexity. Bigger number of neurons

also cause to complexity of algorithms. Small number of neurons might be responsible of underfitting,

but more neurons than necessity is reason to overfitting also. So, here we have to find an optimal value

of number of neurons in the hidden layer. While we decide the optimum value we have to balance the

tradeoff carefully.

As we can see in performance table, 5 neurons for this dataset are quite less. Training and

validation sets pretty bad with compared to other neuron sizes. Obviously, there is a lack of sending

information over network. Because our Adaptive GD algorithm trying to minimize our cost function, but

our weight parameters cannot be able to carry bigger mass values. This is causes underfitting problem.

When we compared 10 and 15 neurons performances, we see that 15 neurons is getting better

results in training set, but not in test set. And also we see that training and validation performances for

15 neurons are very different. Its reason is, too many weight parameters causing to the overtraining. We

have 13 input parameters in our dataset but we defined 15 neurons in hidden layer.

Neuron Size / Performance Training Validation Test Total Perf.

5 Neurons 59.678844 50.404263 100.167308 62.8002

10 Neurons 20.875673 19.159098 10.541039 19.6706

15 Neurons 13.784050 29.685621 16.404043 15.6362

5- Conclusion

As we seen in the visualized plots of different parameters and algorithms, there is no best

choice. And also we can say that different dataset divisions may cause totally different results (rarely).

For sure, we can change a lot of parameters in these algorithms and try to cross validate all of them. But

for the sake of this project, we only analyzed most of them with default parameters. As you know,

project experiments and results tightly coupled with given dataset. So, we can simply remember the

there is no free lunch theorem.

In most of neural network linear fitting problems we have to resolve some cross validation

problems for getting good fitting results whether it is a simple or complex dataset. Some of them are:

Hidden Layer Neuron size

Good Number of iterations for preventing the underfitting and overfitting

Time, Space, Accuracy tradeoff

Algorithm based predetermined values (learning rate, bias values, etc.)

For our house price estimation dataset we tried different data sizes, neuron sizes, algorithms.

Instead of taking one predefined data divisions, we used randomly divided different data divisions for

every experiment. So, now we are able to conclude different results from each experiment, and we saw

that it is not affecting the predicted results as much. We tried to find the best fit for our target results.

As far as we made experiment, we inspected that algorithms have different accuracies in training,

validation and test sets.

It’s seen that more training data size gives us more accuracy, but it also needs to be divided into

good proportions. Otherwise, it will lead to over training problem.

Another important criteria that we see is hidden layer’s neuron size. Giving less number of

neurons definitely shows us the dramatically decrease of our performance, means that neurons cannot

be able to carry enough value to generate better results.

From training functions point of view, we can easily say one can perform better at something

and other is at another. For ex, Adaptive GD is good at test set performance but very slow compared to

MLA. There are some tradeoffs (like time-accuracy) in choice of algorithms.

For this dataset, I would go with 10 hidden layer neurons, 80% training set, 10% validation set,

10% test set, and Adaptive GD with momentum method. Because of the dataset is not so big, I would

pick accuracy rather than timing.

Of course, it would be very good to make other experiments with different parameters or

before applying the ANN we could try some dimensionality reduction methods (like PCA, LCA or some

feature selection). Also we could have tried different resampling methodologies like one in one out,

bootstrapping. These all would give us very good information. For now, we are be aware of these

methods, but unfortunately not be able to accomplish.

References:


http://www.mathworks.com/help/nnet/examples/house-price-estimation.html

http://www.mathworks.com/help/nnet/gs/fit-data-with-a-neural-network.html

http://www.mathworks.com/help/nnet/ug/choose-a-multilayer-neural-network-training-

function.html

http://www.mathworks.com/help/nnet/ref/traingdx.html

http://www.mathworks.com/help/nnet/ref/trainlm.html

http://www.mathworks.com/help/nnet/ref/trainscg.html

http://radio.feld.cvut.cz/matlab/toolbox/nnet/trainlm.html

http://radio.feld.cvut.cz/matlab/toolbox/nnet/traingdx.html

http://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm

http://en.wikipedia.org/wiki/Gradient_descent

http://alumni.cs.ucr.edu/~vladimir/cs171/nn_summary.pdf

http://aix1.uottawa.ca/~isoltani/ANN.pdf


http://www.mathworks.com/help/nnet/examples/house-price-estimation.html

http://www.mathworks.com/help/nnet/gs/fit-data-with-a-neural-network.html

http://www.mathworks.com/help/nnet/ug/choose-a-multilayer-neural-network-training-function.html

http://www.mathworks.com/help/nnet/ug/choose-a-multilayer-neural-network-training-function.html

http://www.mathworks.com/help/nnet/ref/traingdx.html

http://www.mathworks.com/help/nnet/ref/trainlm.html

http://www.mathworks.com/help/nnet/ref/trainscg.html

http://radio.feld.cvut.cz/matlab/toolbox/nnet/trainlm.html

http://radio.feld.cvut.cz/matlab/toolbox/nnet/traingdx.html

http://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm

http://en.wikipedia.org/wiki/Gradient_descent

http://alumni.cs.ucr.edu/~vladimir/cs171/nn_summary.pdf

http://aix1.uottawa.ca/~isoltani/ANN.pdf

house price estimation as a function fitting problem with using ann approach

Documents

dataset information

uci housing dataset

toolbox information

neural network toolbox

different information

data of housing values

neural networks

housing median price