analysis of the boston housing data from the 1970 census

28
1 [email protected] U37074009 Analysis of the Boston Housing Data from the 1970 census: Diverse Tests and Model Selection Processes regarding the Variables in Boston Housing Data Shuai Yuan 1 December 8, 2016 Abstract In this project, we study the Boston Housing Data that was offered by Harrison and Rubinfeld, 1978. The data contained many different variables that related to Boston Housing for 506 tracts of Boston from the 1970 census. The data is included in the R package mlbench. Using the data and R software, we first study the scatterplot matrix and the correlation of different variables to find their relations briefly. Then, we make various tests for many null hypotheses to examine the properties of different models. Finally, we perform the model selections by using different methods such as the forward algorithm, backward algorithm as well as the AIC and BIC criterion to find and analyze the most fitted model for our data sets. And the same time, we also compute the SSPE for our subset of data.

Upload: shuai-yuan

Post on 10-Jan-2017

166 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of the Boston Housing Data from the 1970 census

[email protected] U37074009

Analysis of the Boston Housing Data from the 1970 census:

Diverse Tests and Model Selection Processes regarding the

Variables in Boston Housing Data

Shuai Yuan1

December 8, 2016

Abstract

In this project, we study the Boston Housing Data that was offered by Harrison and Rubinfeld,

1978. The data contained many different variables that related to Boston Housing for 506 tracts of

Boston from the 1970 census. The data is included in the R package mlbench. Using the data and

R software, we first study the scatterplot matrix and the correlation of different variables to find

their relations briefly. Then, we make various tests for many null hypotheses to examine the

properties of different models. Finally, we perform the model selections by using different methods

such as the forward algorithm, backward algorithm as well as the AIC and BIC criterion to find

and analyze the most fitted model for our data sets. And the same time, we also compute the SSPE

for our subset of data.

Page 2: Analysis of the Boston Housing Data from the 1970 census

1

Contents

1 Introduction 2

2 Analysis 3

2.1 Analysis of the linearity between variables 3

2.1.1 Scatterplot matrix for variables 3

2.1.2 Explanation of Correlation between two variables 4

2.2 The statistical tests for the Null Hypotheses of the fitted model 6

2.3 Model selection by using the forward algorithm 9

2.4 Model selection by using the backward algorithm 11

2.5 Model selection by using the AIC and BIC criterion 12

2.6 Analysis of the related statistics 14

2.6.1 Fit the model by using the subset of the data 14

2.6.2 Compute and analyze the SSPE for subset of the data 14

3 Conclusion 16

4 Appendix 18

Page 3: Analysis of the Boston Housing Data from the 1970 census

2

1 Introduction The data of the Boston Housing from the 1970 census are used in this project. The dataset contains

14 variables with 506 observations. The data is included in the R package mlbench.

In this project, we used various tools to analyze the Boston Housing data and the most frequently

used method is the linear regression. At the same time, we also used hypothesis testing, t-test, F-

test as well as model selection as our methods to analyze the properties of the related data. Using

the data and R software, we first study the scatterplot matrix and the correlation of different

variables to find their relations briefly. Then, we make various tests for many null hypotheses to

examine the properties of different models. Finally, we perform the model selections by using

different methods such as the forward algorithm, backward algorithm as well as the AIC and BIC

criterion to find and analyze the most fitted model for our data sets. And the same time, we also

compute the SSPE for our subset of data.

The outline for the remainder of the paper is as follows. In Section 2, we provide the main results

and analysis towards the multiple aspects of our topics. Section 3 concludes. In the Appendix, we

provide our R codes as well as the related outputs. Finally, we also provide the references that we

use in this project. To be specific, the part 2.1.1 is for the question 1, part 2.1.2 is for the question

2, part 2.2 is for the question 3, part 2.3 is for the question 4, part 2.4 is for the question 5, part 2.5

is for the question 6, part 2.6 is for the question 7.

Page 4: Analysis of the Boston Housing Data from the 1970 census

3

2 Analysis To get a briefly understanding of the relationships between different variables at the very

beginning, we get the scatterplot matrix of these different variables and find the non-linearity

between these variables. Therefore, the correlation of these variables may not appropriate for

describing the relationships within the variables. At the same time, we also compute different test

statistics and test many hypotheses for the general model. Moreover, we also perform variable

selection using forward algorithm, backward algorithm, AIC and BIC criterion. We find that both

criterions select the same model for us and we explain the reason why the selected model is the

one that we need. Finally, we also get the fitted model for subset and compute and compare the

SSPE of the selected models.

2.1 Analysis of the linearity between variables

2.1.1 Scatterplot matrix for variables

First, according to the description of the R Package “mlbench”, we can get the meaning of the

following variables as well as the scatterplot matrix for these four variables which are listed below:

𝒏𝒐𝒙: Nitric oxides concentration (parts per 10 million).

𝒊𝒏𝒅𝒖𝒔: Proportion of residential land zoned for lots over 25,000 sq.ft.

𝒅𝒊𝒔: Weighted distances to five Boston employment centers.

𝒕𝒂𝒙: Full-value property-tax rate per USD 10,000.

plot 1 Scatterplot matrix for the variables nox, indus, dis, tax

Page 5: Analysis of the Boston Housing Data from the 1970 census

4

According to the scatterplot matrix, we can find that these four variables are all related in some

patterns. For instance, generally speaking, the variable 𝑛𝑜𝑥 is negatively related to the variable

𝑑𝑖𝑠 and the variable 𝑖𝑛𝑑𝑢𝑠 is also negatively related to the variable 𝑑𝑖𝑠 . On the other hand,

generally speaking, the relationships between other variables are positive at the low volume level,

while the relationships may get vague and non-related at the high volume level.

On the other hand, we can also find the possible explanations according to the meanings of these

variables. Because the variable 𝑛𝑜𝑥 means the “Nitric oxides concentration (parts per 10 million).”,

which also represents the degree of air pollution in this area. For the variable 𝑑𝑖𝑠, it means the

“Weighted distances to five Boston employment centers.”, which also represents the degree of

living away from the downtown. And for the variable 𝑖𝑛𝑑𝑢𝑠 , it means the “Proportion of

residential land zoned for lots over 25,000 sq.ft.”, which also represents the level of economy of

residents. Because only if when people own enough money, will they use their money to build

their own parking lots which are also quite wide. Therefore, we can find the possible explanations

for these relationships. As we all know, the air of the area that far away from the downtown is

better because there are more trees and therefore, the level of pollution there can be at a low level.

So it is reason to see that there is negative relationship between the variable 𝑛𝑜𝑥 and 𝑑𝑖𝑠. On the

other hand, the degree of economic development in the areas that far away from the downtown is

worse than that of the downtown areas and therefore, the proportion of residential land zoned for

large lots is smaller than that of the downtown areas. So it is reason to see that there is negative

relationship between the variable 𝑖𝑛𝑑𝑢𝑠 and 𝑑𝑖𝑠.

2.1.2 Explanation of Correlation between two variables

We know that the formula of correlation coefficient between two variables is that:

ρ23 =Cov(X, Y)D(X) D(Y)

Therefore, according to the R codes, we can find that the correlation between the variable 𝑛𝑜𝑥 and

the variable 𝑑𝑖𝑠 is about -0.7692301, which may give us the information that these two variables

are negatively correlated.

Page 6: Analysis of the Boston Housing Data from the 1970 census

5

However, the thing we should not forget is that the correlation coefficient between two variables

is used for examining the relationship for linear regression model, or in other words, the linear

relationship between two variables. But we can find from the scatterplot that the relationship

between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠 is more likely as exponential relationship, which

means that there is not reasonable to use the correlation coefficient between these two variables to

examine the relationship between them.

On the other hand, we can also test their relationship of them by getting the model between them.

From the model, we assume that there is an exponential relation between them and we get the

significantly p-value for this model. Therefore, according to the discussion above, we can safely

draw the conclusion that we cannot use the correlation between these two variables to quantify the

strength of relationship between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠.

Page 7: Analysis of the Boston Housing Data from the 1970 census

6

2.2 The statistical tests for the Null Hypotheses of the fitted model

For this question, the full model given only contains five variables and intercept. 𝛽?means the

intercept, 𝛽@ measures the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑑𝑖𝑠 increased, 𝛽A

measures the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑙𝑜𝑔(𝑑𝑖𝑠) increases, 𝛽D measure

the change of the variable 𝑛𝑜𝑥 if one unit of variable 𝑑𝑖𝑠^2 increases,𝛽H measure the change of

the variable 𝑛𝑜𝑥 if one unit of the variable 𝑖𝑛𝑑𝑢𝑠 increases, 𝛽I measure the change of the variable

𝑛𝑜𝑥 if one unit of the variable 𝑡𝑎𝑥 increases. Since three of them have already be given in the data

set, we just need to transform and add the left two variables which are log(dis) and dis^2. So we

create the new variables whose names are log(dis) and dissquare and will use them to refer the

variable log(dis) and the variable dis^2.

For this section, since we want to decide whether or not specified parameters are equal to 0 or each

other, we will do the F-test for all of three sub-questions. At the beginning, we have the formula

for F-test as below:

𝐹 = (𝑅𝑆𝑆YZ − 𝑅𝑆𝑆\Z𝑑𝑓YZ − 𝑑𝑓\Z

)/(𝑅𝑆𝑆\Z𝑑𝑓\Z

)

The table for the summary of F-test value and corresponding p-value for the question are

summarized below:

question a question b question c F-test(value) 5.911 6.0524 42.80353

p-value 0.0154 0.002528 0.0001 Table 1 The F-test(value) and p-value of question a, b, c

We will use it to evaluate the question.

Question a:

According to the definition of the null hypothesis test, the main idea is to test whether the

coefficient of the variable log(dis) is equal to 0 or not. Since the variable log(dis) is the only

target we want to focus here, we can just build a new regression model which does not contain the

variable log(dis) to compare with the original regression model. When we compare two regression

models, we will do the F-test to see if they are significantly different with each other. Form the

Page 8: Analysis of the Boston Housing Data from the 1970 census

7

results of the R codes, the F-value is 5.911 and corresponding p-value is 0.0154. To see whether

we need to reject the null hypothesis, it depends on significant level of alpha. Here, we set the

value of alpha to 0.05. Since the p-value is smaller than 0.05, we will reject the null hypothesis

and conclude that 𝛽A is not equal to 0 at the 95% confidence level. However, if we want to be 99%

confidence about the result, the alpha will change to be 0.01. Since the p-value is bigger than 0.01

here. We cannot reject the null hypothesis based at the 99% confidence level.

Question b:

For part b, we want to make sure whether the coefficient of the variable dis and the variable dis^2

are equal to 0 or not. Since it only focuses on two variable and want to make sure if they are

significantly different from 0. We can do the similar test as part a. For this question, we will build

another regression model which only contains intercept and three variables except the variable

dis^2 and the variable dis. Then we compare the new regression model with the original full

model to see if they are significantly different. We will also do a F-test to compare the two models.

Here the null hypothesis is that 𝛽@ = 𝛽D = 0, the alternative hypothesis is that as least one of them

is not equal to 0. The value for F-test is 6.0524 and corresponding p-value is 0.002528. Similarly,

we will also set the alpha to be 0.05 here. Since the p-value is smaller than 0.05, we will reject the

null hypothesis and conclude that among the 𝛽@, 𝛽D, at least one of them is not equal to 0.

Question c:

Situation for part c is much different. Since the question want to make sure if 𝛽A = 𝛽D = 0 and if

𝛽H = 𝛽I. We will not use the tradition way as above but use matrix to get the solution. We will

divide the first section(𝛽A = 𝛽D = 0) as if 𝛽A = 0 and if 𝛽D = 0. So the first line of matrix A has

only a “1” corresponding to the position of 𝛽A and “0” for all the other variables. For the second

line of matrix A, it only has a “1” corresponding to the position of 𝛽D and “0” for all the other

variables. When we compute the matrix A times the variable matrix, the first two line we can get

is only 𝛽A and 𝛽D. To make sure whether or not 𝛽H = 𝛽I. We will put “1” corresponding to the

position of 𝛽H and “-1” corresponding to the position of 𝛽I for the third line of matrix A. So the

output of third line will become 𝛽H - 𝛽I. To test if each of the result we get equal to “0”, we will

make a F-test here. The value for F-test is 42.80353 and p-value corresponding to it is less than

0.0001. Assuming we set alpha = 0.05 here, apparently the p is smaller than alpha, so we will reject

Page 9: Analysis of the Boston Housing Data from the 1970 census

8

the null hypothesis here and conclude that at least one of 𝛽A = 𝛽D is not equal to 0 or 𝛽H is not equal

to 𝛽I.

Page 10: Analysis of the Boston Housing Data from the 1970 census

9

2.3 Model selection by using the forward algorithm

In this section, we will use the method of forward algorithm to analyze the relationship between

response variable and potential explanatory variables below. Moreover, according to the question’s

requirements, we have transformed the original variables to different formats, which are all

presented below.

Response variable:

𝐥𝐨𝐠(𝐦𝐞𝐝𝐯), which means that we now use the natural logarithm of the median value of owner-

occupied homes in $1000's.

Potential explanatory variables:

𝐫𝐦^𝟐, which means the square of average number of rooms per dwelling.

𝐥𝐨𝐠(𝐝𝐢𝐬), which means the natural logarithm of weighted distances to five Boston employment

centers.

𝐚𝐠𝐞, which means the proportion of owner-occupied units built prior to 1940.

We performed variable selection using a forward algorithm with a significance level of 5%. For

the forward algorithm, we regressed the models with all variables separately. We name the models

from “forward11” to “forward14”, which you can find with details in Appendix. The results of the

regressions were summarized as the following:

name model variable t - value Pr(>|t|)

forward11 log(medv) ~ 1 intercept 167 <2e-16

forward12 log(medv) ~ rm^2 - 1 rm^2 130 <2e-16

forward13 log(medv) ~ age - 1 age 44.84 <2e-16

forward14 log(medv) ~ log(dis) - 1 log(dis) 54.66 <2e-16

Table 2 The summary of different models from forward11 to forward14

We could observe from the table that while all the variables are significant, the intercept has the

largest t-value. Hence, we chose the intercept to our model. Next, we regressed the intercept with

each of the left three variables in the models called “foward21” to “foward23”. The summarized

results are shown in the table below:

Page 11: Analysis of the Boston Housing Data from the 1970 census

10

name model variable t - value Pr(>|t|)

forward21 log(medv) ~ rm^2 rm^2 18.8 <2e-16

forward22 log(medv) ~ age age 11.42 <2e-16

forward23 log(medv) ~ log(dis) log(dis) 9.965 <2e-16

Table 3 The summary of different models from forward21 to forward23

As shown in the table, the p-values of all the variables are significant. However, comparing to the

other variables, the t-value of the variable rm^2 has the largest one. So we added rm^2. Then, we

tested the combination of the variables rm^2, dis, age and intercept separately in the models,

which named as “forward31” and “forward32”. We got the following table as below:

name model variable t - value Pr(>|t|)

forward31 log(medv) ~ rm^2 + log(dis) log(dis) 8.269 1.21e-15

forward32 log(medv) ~ rm^2 + age age -10.23 <2e-16

Table 4 The summary of different models from forward31 to forward32

From the result above, the p-value of all the other variables are significant but the variable age has

smaller p-value than the variable dis. Therefore, we also add the variable age to our model. Finally,

we regressed the response variable log(medv) on all of the variables in the following model

“forward41”.

name model variable t - value Pr(>|t|)

forward41 log(medv) ~ rm^2 + age + log(dis) log(dis) 1.068 0.286

Table 4 The summary of different models from forward41

Based on the table above, the variable log(dis) is not significant in the model and thus, we

removed it from our model. Therefore, after the forward algorithm selection, our final model is

shown as the following,

log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A + 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀

which ε is the error term.

Page 12: Analysis of the Boston Housing Data from the 1970 census

11

2.4 Model selection by using the backward algorithm

In this section, we will use the method of backward algorithm to analyze the relationship between

response variable and potential explanatory variables below. Moreover, according to the question’s

requirements, we used the transformed formats of the original variables that were defined in the

previous section. We performed variable selection using a backward algorithm with a significance

level of 5%. For the backward algorithm, we regressed the models with all variables at first. We

named the models from “backward11”, which you can find with details in Appendix. The results

of the regressions were summarized as the following:

name model variable t - value Pr(>|t|)

backward11 log(medv) ~ rm^2 + age + log(dis) intercept 21.224 <2e-16

rm^2 17.676 <2e-16

age -5.758 1.48e-08

log(dis) 1.068 0.286

Table 5 The summary of different models from backward11

Based on the result, we can find that except the variable log(dis) whose t-value is 1.068 and p-

value is 0.286, all the explanatory variables are significant. Thus, we removed the variable dis and

built a new model with the left variables, which is called “backward21”. Here are the results:

name model variable t - value Pr(>|t|)

backward21 log(medv) ~ rm^2 + age intercept 32.12 <2e-16

rm^2 17.85 <2e-16

age -10.23 <2e-16

Table 6 The summary of different models from backward21

After deleting the variable log(dis) from the model, we got left variables are all significant and

thus, we ended up with the model “backward21”. We got the same model as that by the process of

forward algorithm,

log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A + 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀

which ε is the error term.

Page 13: Analysis of the Boston Housing Data from the 1970 census

12

2.5 Model selection by using the AIC and BIC criterion

First of all, we can do a preliminary analysis to the full model we are interested. In the full linear

regression model, the t-value and p-value are used to determine whether each of the variable is

significant for the model. Setting alpha = 0.05 here, we can see easily that the three of the variables

rm^2, age and intercept have smallest p-value that also less than 0.05 which means they are

significant. However, the variable log(dis) has the p-value of 0.286 which is not significant at all.

In this section, we will try to perform variable selection using AIC and BIC criterion. The

definition for AIC is that the measure of the relative quality of statistical models for a given set of

data. Given a collection of models for the data, AIC estimates the quality of each model, relative

to each of the other models. Hence, AIC provides a means for model selection. At the same time,

the definition for BIC is that the criterion for model selection among a finite set of models and the

model with the lowest BIC score is preferred. And the formulas for AIC and BIC are shown as

below,

𝐴𝐼𝐶(𝑚) = 𝑛 ∗ log𝑅𝑆𝑆 𝑚

𝑛 + 2 ∗ 𝑚v

𝐵𝐼𝐶(𝑚) = 𝑛 ∗ log𝑅𝑆𝑆 𝑚

𝑛 + log(𝑛) ∗ 𝑚v

where 𝑚 is the regression model, 𝑛 is the sample size, 𝑚v denotes the number of variables in the

model 𝑚. In the project, the sample size is 506 and all we need to do is to put all possible regression

model into the R software to compute the corresponding AIC and BIC scores. The candidate

models of the different regression model are summarized as below:

Page 14: Analysis of the Boston Housing Data from the 1970 census

13

Candidate Models AIC Score BIC Score

log(medv) ~ 1 -904.371 -900.145

log(medv) ~ rm^2 - 1 -659.289 -655.063

log(medv) ~ age - 1 321.927 326.154

log(medv) ~ log(dis) - 1 155.969 160.195

log(medv) ~ rm^2 + log(dis) - 1 -750.189 -741.736

log(medv) ~ log(dis) + age - 1 -533.471 -525.018

log(medv) ~ rm^2 + age - 1 -702.556 -694.102

log(medv) ~ age -1018.83 -1010.378

log(medv) ~ rm^2 -1171.36 -1162.907

log(medv) ~ log(dis) -993.379 -984.926

log(medv) ~ rm^2 + log(dis) -1233.86 -1221.175

log(medv) ~ rm^2 + age -1265.07 -1252.394

log(medv) ~ log(dis) + age -1021.36 -1008.683

log(medv) ~ rm^2 + log(dis) + age - 1 -940.149 -929.47

log(medv) ~ rm^2 + log(dis) + age -1264.22 -1247.315

log(medv) ~ -1 1132.453 1132.453

Table 7 The AIC and BIC scores of all possible models

From the table above, we can find that the regression model with smallest AIC score has variable

rm^2, age as well as the intercept. The regression model with the smallest BIC score is the same

model. And when we checked the regression model, we can find the model only contains the

variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the model, all

the variables in the model are significant. So we will select the model which contains the variables

rm^2, age and intercept under the AIC as well as BIC criterion.

Page 15: Analysis of the Boston Housing Data from the 1970 census

14

2.6 Analysis of the related statistics

2.6.1 Fit the model by using the subset of the data

According to the results above, we finally choose the model of “m12”, which has the minimum

value of BIC, to be used as our fitted model. From the question 6, we can find that the fitted model

can be written as the following,

log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A + 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀

which ε is the error term. Therefore, we can now use the data from Group1 to fit the above model.

From the results generated by R section, we can find that the fitted model is shown as the following:

log medv = 2.3360 + 0.0256 ∗ 𝑟𝑚A − 0.0048 ∗ 𝑎𝑔𝑒

Moreover, the p-values of all the explanatory variables are all significant at all level.

2.6.2 Compute and analyze the SSPE for subset of the data

On the other hand, we can also apply another method which is called the Cross-Validation to

further analyze the model selection process. For this method, we need to apply the following

processes. First, we split the data into two different subsets according to a user defined criterion,

Group1 and Group2, which are also called the training data and the validation data. Second, we fit

the model using the data from the Group1. Third, based on the data from the Group2, we make the

prediction for the response variable log(𝑚𝑒𝑑𝑣)�. At the same time, we also denote the predicted

value by log(𝑚𝑒𝑑𝑣)� . At last, we compute the value of SSPE, which is also called “Sum of

Squared Prediction Error”.

Therefore, according to the question, we first divided the original data set “BostonHousing” into

two Groups, which is the Group1 and the Group2 respectively. And then, we can get the SSPE of

the Group2 by computing the SSPE according to its definition, which is shown as below:

SSPE = (log(𝑚𝑒𝑑𝑣)� − log(𝑚𝑒𝑑𝑣)�)Av

��@

In the equation above, log(𝑚𝑒𝑑𝑣)� denotes the response variables in the Group2 and log(𝑚𝑒𝑑𝑣)�

denotes the predicted values of the response variable, which were computed by the prediction

function in R section. Therefore, we can compute the SSPE of the Group2 as 0.02835043.

At the same time, we can find that the model we get from the part2.4(question 5) is that,

Page 16: Analysis of the Boston Housing Data from the 1970 census

15

log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A + 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀

which ε is the error term. And the model is the same as we get from the part2.5(question 6).

Therefore, we get the same results for the same model.

Page 17: Analysis of the Boston Housing Data from the 1970 census

16

3 Conclusion

In this project, we first got the scatterplot matrix of four different variables, 𝑛𝑜𝑥, 𝑖𝑛𝑑𝑢𝑠, 𝑑𝑖𝑠 and

𝑡𝑎𝑥. According to the scatterplot matrix, we found that these four variables are all related in some

patterns. Generally speaking, the variable 𝑛𝑜𝑥 is negatively related to the variable 𝑑𝑖𝑠 and the

variable 𝑖𝑛𝑑𝑢𝑠 is also negatively related to the variable 𝑑𝑖𝑠. On the other hand, generally speaking,

the relationships between other variables are positive at the low volume level, while the

relationships may get vague and non-related at the high volume level. On the other hand, we can

also find the possible explanations according to the meanings of these variables. On the other hand,

we also found that the non-linearity between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠. Therefore, we

cannot use the correlation between these two variables to quantify the strength of relationship

between the variable and the variable 𝑑𝑖𝑠.

Second, we also made several tests for the Null hypotheses of the fitted model. Using the F-test

and the related p-values, we found that the p-values for the null hypotheses 𝛽A = 0, 𝛽@ = 𝛽D = 0,

𝛽@ = 𝛽D = 0 and 𝛽H = 𝛽I are all smaller than 0.05, which means we need to reject all the null

hypotheses.

Third, we used the forward algorithm to find the best model for the regression problem. To do that,

we first applied all the variables into the model and we used the p-values of different variables to

test that whether the certain variable is significant in the model. And then, we found that the final

model includes the variable 𝑟𝑚A, the variable age as well as the intercept. According to the results,

we finally found the best model. At the same time, we also used the backward algorithm to do the

model selection. By using the backward algorithm, we first applied the model with nothing, and

then, we added the variables one by one into the model to test the p-values of these variables.

Finally, according to the results, we found that the model we found through the backward

algorithm is the same as the one found by using the forward algorithm.

Page 18: Analysis of the Boston Housing Data from the 1970 census

17

And the same time, we also used both the AIC as well as the BIC criterion to do model selection

processes. After doing the model selection, we found that the regression model with smallest AIC

score has variable rm^2, age as well as the intercept. The regression model with the smallest BIC

score is the same model. And when we checked the regression model, we found the model only

contains the variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the

model, all the variables in the model are significant. So we will select the model which contains

the variables rm^2, age and intercept under the AIC as well as BIC criterion.

Finally, we also applied another method which is called the Cross-Validation to further analyze

the model selection process. And we also computed the sum of squared prediction error, SSPE, of

the Group2. At the same time, we can find that the model we get from the part2.4(question 5) is

the same as we get from the part2.5(question 6). Therefore, we get the same results for the same

model.

Page 19: Analysis of the Boston Housing Data from the 1970 census

18

4 Appendix

The following materials are the related R codes that used for this project. The contents with bold texts denote the original codes. R codes: # Question 1: > nox <- BostonHousing$nox > indus <- BostonHousing$indus > dis <- BostonHousing$dis > tax <- BostonHousing$tax > pairs(~nox+indus+dis+tax,main="Scatterplot for nox,indus,dis,tax") # Question 2: > cor(nox,dis) [1] -0.7692301 > model <- lm(nox~1/dis) > summary(model) # Question 3: (a) > library("mlbench", lib.loc="~/Library/R/3.3/library") > data("BostonHousing") > BostonHousing <- transform(BostonHousing, logdis = log(dis)) > BostonHousing <- transform(BostonHousing, dissquare = dis*dis) > u1 <- lm(nox ~ dis+logdis+dissquare +indus + tax, BostonHousing) > u2 <- lm(nox ~ dis+dissquare +indus + tax, BostonHousing) > anova(u1,u2) Analysis of Variance Table Model 1: nox ~ dis + logdis + dissquare + indus + tax Model 2: nox ~ dis + dissquare + indus + tax Res.Df RSS Df Sum of Sq F Pr(>F) 1 500 1.6897 2 501 1.7097 -1 -0.019976 5.911 0.0154 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (b) > u3 <- lm(nox ~ logdis +indus + tax, BostonHousing) > anova(u1,u3) Analysis of Variance Table Model 1: nox ~ dis + logdis + dissquare + indus + tax Model 2: nox ~ logdis + indus + tax Res.Df RSS Df Sum of Sq F Pr(>F) 1 500 1.6897

Page 20: Analysis of the Boston Housing Data from the 1970 census

19

2 502 1.7306 -2 -0.040907 6.0524 0.002528 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (c) > A = matrix(c(0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-1),nrow=3,byrow=TRUE) > Model <- lm(nox ~ dis + logdis + dissquare + indus + tax, BostonHousing) > variance <- (A %*% vcov(Model) %*% t(A)) > E <- eigen(variance, TRUE) > Evalues <- E$values > Evectors <-E$vectors > sqrtvariance <- Evectors %*% diag(1/sqrt(Evalues)) %*% t(Evectors) > Z <- sqrtvariance %*% A %*% coef(Model) > F <- sum(Z^2)/3 > F [1] 42.80353 # Question 4: > Medv<-log(medv) > Rm<-(rm)^2 > Dis<-log(dis) > forward11<-lm(Medv~1) > summary(forward11) Call: lm(formula = Medv ~ 1) Residuals: Min 1Q Median 3Q Max -1.42507 -0.19983 0.01949 0.18436 0.87751 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.03451 0.01817 167 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4088 on 505 degrees of freedom > forward12<-lm(Medv~Rm-1) > summary(forward12) Call: lm(formula = Medv ~ Rm - 1)

Page 21: Analysis of the Boston Housing Data from the 1970 census

20

Residuals: Min 1Q Median 3Q Max -2.5860 -0.1694 0.1560 0.4042 2.3811 Coefficients: Estimate Std. Error t value Pr(>|t|) Rm 0.0735845 0.0005646 130.3 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5208 on 505 degrees of freedom Multiple R-squared: 0.9711, Adjusted R-squared: 0.9711 F-statistic: 1.699e+04 on 1 and 505 DF, p-value: < 2.2e-16 > forward13<-lm(Medv~age-1) > summary(forward13) Call: lm(formula = Medv ~ age - 1) Residuals: Min 1Q Median 3Q Max -2.0839 -0.5927 0.3357 1.5142 3.4463 Coefficients: Estimate Std. Error t value Pr(>|t|) age 0.0369330 0.0008236 44.84 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.373 on 505 degrees of freedom Multiple R-squared: 0.7993, Adjusted R-squared: 0.7989 F-statistic: 2011 on 1 and 505 DF, p-value: < 2.2e-16 > forward14<-lm(Medv~Dis-1) > summary(forward14) Call: lm(formula = Medv ~ Dis - 1) Residuals: Min 1Q Median 3Q Max -2.2240 -0.4238 0.6628 1.2085 3.6475 Coefficients: Estimate Std. Error t value Pr(>|t|)

Page 22: Analysis of the Boston Housing Data from the 1970 census

21

Dis 2.17068 0.03972 54.66 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.165 on 505 degrees of freedom Multiple R-squared: 0.8554, Adjusted R-squared: 0.8551 F-statistic: 2987 on 1 and 505 DF, p-value: < 2.2e-16 > forward21<-lm(Medv~Rm) > summary(forward21) Call: lm(formula = Medv ~ Rm) Residuals: Min 1Q Median 3Q Max -1.20269 -0.10530 0.06992 0.17255 1.31948 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.878478 0.063036 29.8 <2e-16 *** Rm 0.028909 0.001537 18.8 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3137 on 504 degrees of freedom Multiple R-squared: 0.4123, Adjusted R-squared: 0.4112 F-statistic: 353.6 on 1 and 504 DF, p-value: < 2.2e-16 > forward22<-lm(Medv~age) > summary(forward22) Call: lm(formula = Medv ~ age) Residuals: Min 1Q Median 3Q Max -1.21816 -0.20280 -0.01733 0.16722 1.08442 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.4860274 0.0427295 81.58 <2e-16 *** age -0.0065843 0.0005765 -11.42 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 23: Analysis of the Boston Housing Data from the 1970 census

22

Residual standard error: 0.3647 on 504 degrees of freedom Multiple R-squared: 0.2056, Adjusted R-squared: 0.204 F-statistic: 130.4 on 1 and 504 DF, p-value: < 2.2e-16 > forward23<-lm(Medv~Dis) > summary(forward23) Call: lm(formula = Medv ~ Dis) Residuals: Min 1Q Median 3Q Max -1.18240 -0.21227 -0.02365 0.16558 1.20522 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.66935 0.04024 66.338 <2e-16 *** Dis 0.30737 0.03084 9.965 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.374 on 504 degrees of freedom Multiple R-squared: 0.1646, Adjusted R-squared: 0.163 F-statistic: 99.31 on 1 and 504 DF, p-value: < 2.2e-16 > forward31<-lm(Medv~Rm+Dis) > summary(forward31) Call: lm(formula = Medv ~ Rm + Dis) Residuals: Min 1Q Median 3Q Max -1.05461 -0.12689 0.03383 0.16131 1.46235 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.746011 0.061332 28.468 < 2e-16 *** Rm 0.026088 0.001484 17.585 < 2e-16 *** Dis 0.206437 0.024965 8.269 1.21e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2946 on 503 degrees of freedom Multiple R-squared: 0.4827, Adjusted R-squared: 0.4806 F-statistic: 234.6 on 2 and 503 DF, p-value: < 2.2e-16

Page 24: Analysis of the Boston Housing Data from the 1970 census

23

> forward32<-lm(Medv~Rm+age) > summary(forward32) Call: lm(formula = Medv ~ Rm + age) Residuals: Min 1Q Median 3Q Max -1.0789 -0.1094 0.0335 0.1300 1.4183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3346303 0.0726764 32.12 <2e-16 *** Rm 0.0256312 0.0014361 17.85 <2e-16 *** age -0.0047407 0.0004632 -10.23 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2856 on 503 degrees of freedom Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117 F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16 > forward41<-lm(Medv~Rm+Dis+age) > summary(forward41) Call: lm(formula = Medv ~ Rm + Dis + age) Residuals: Min 1Q Median 3Q Max -1.06502 -0.11534 0.02519 0.13058 1.43388 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2520854 0.1061091 21.224 < 2e-16 *** Rm 0.0254895 0.0014420 17.676 < 2e-16 *** Dis 0.0402145 0.0376701 1.068 0.286 age -0.0041510 0.0007209 -5.758 1.48e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2856 on 502 degrees of freedom Multiple R-squared: 0.5147, Adjusted R-squared: 0.5118 F-statistic: 177.5 on 3 and 502 DF, p-value: < 2.2e-16

Page 25: Analysis of the Boston Housing Data from the 1970 census

24

> forward<-lm(Medv~Rm+age) > summary(forward) Call: lm(formula = Medv ~ Rm + age) Residuals: Min 1Q Median 3Q Max -1.0789 -0.1094 0.0335 0.1300 1.4183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3346303 0.0726764 32.12 <2e-16 *** Rm 0.0256312 0.0014361 17.85 <2e-16 *** age -0.0047407 0.0004632 -10.23 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2856 on 503 degrees of freedom Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117 F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16 # Question 5: > backward11<-lm(Medv ~Rm+ age+ Dis) > summary(backward11) Call: lm(formula = Medv ~ Rm + age + Dis) Residuals: Min 1Q Median 3Q Max -1.06502 -0.11534 0.02519 0.13058 1.43388 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2520854 0.1061091 21.224 < 2e-16 *** Rm 0.0254895 0.0014420 17.676 < 2e-16 *** age -0.0041510 0.0007209 -5.758 1.48e-08 *** Dis 0.0402145 0.0376701 1.068 0.286 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2856 on 502 degrees of freedom Multiple R-squared: 0.5147, Adjusted R-squared: 0.5118 F-statistic: 177.5 on 3 and 502 DF, p-value: < 2.2e-16

Page 26: Analysis of the Boston Housing Data from the 1970 census

25

> backward21<-lm(Medv ~Rm+ age) > summary(backward21) Call: lm(formula = Medv ~ Rm + age) Residuals: Min 1Q Median 3Q Max -1.0789 -0.1094 0.0335 0.1300 1.4183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3346303 0.0726764 32.12 <2e-16 *** Rm 0.0256312 0.0014361 17.85 <2e-16 *** age -0.0047407 0.0004632 -10.23 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2856 on 503 degrees of freedom Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117 F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16 # Question 6: > data("BostonHousing",package="mlbench") > BostonHousing <- transform(BostonHousing, logdis = log(dis)) > BostonHousing <- transform(BostonHousing, logmedv = log(medv)) > BostonHousing <- transform(BostonHousing, rmsq = rm*rm) > attach(BostonHousing) > n <- 506 > m1<-lm(logmedv~1) > m2<-lm(logmedv~rmsq-1) > m3<-lm(logmedv~age-1) > m4<-lm(logmedv~logdis-1) > m5<-lm(logmedv~rmsq+logdis-1) > m6<-lm(logmedv~logdis+age-1) > m7<-lm(logmedv~rmsq+age-1) > m8<-lm(logmedv~age) > m9<-lm(logmedv~rmsq) > m10<-lm(logmedv~logdis) > m11<-lm(logmedv~rmsq+logdis) > m12<-lm(logmedv~rmsq+age) > m13<-lm(logmedv~logdis+age) > m14<-lm(logmedv~rmsq+logdis+age-1) > m15<-lm(logmedv~rmsq+logdis+age)

Page 27: Analysis of the Boston Housing Data from the 1970 census

26

> m16<-lm(logmedv~-1) > > AIC1=n*log(sum(m1$residuals^2)/n)+2*1 > AIC2=n*log(sum(m2$residuals^2)/n)+2*1 > AIC3=n*log(sum(m3$residuals^2)/n)+2*1 > AIC4=n*log(sum(m4$residuals^2)/n)+2*1 > AIC5=n*log(sum(m5$residuals^2)/n)+2*2 > AIC6=n*log(sum(m6$residuals^2)/n)+2*2 > AIC7=n*log(sum(m7$residuals^2)/n)+2*2 > AIC8=n*log(sum(m8$residuals^2)/n)+2*2 > AIC9=n*log(sum(m9$residuals^2)/n)+2*2 > AIC10=n*log(sum(m10$residuals^2)/n)+2*2 > AIC11=n*log(sum(m11$residuals^2)/n)+2*3 > AIC12=n*log(sum(m12$residuals^2)/n)+2*3 > AIC13=n*log(sum(m13$residuals^2)/n)+2*3 > AIC14=n*log(sum(m14$residuals^2)/n)+2*3 > AIC15=n*log(sum(m15$residuals^2)/n)+2*4 > AIC16=n*log(sum(m16$residuals^2)/n)+2*0 > > > BIC1=n*log(sum(m1$residuals^2)/n)+log(n)*1 > BIC2=n*log(sum(m2$residuals^2)/n)+log(n)*1 > BIC3=n*log(sum(m3$residuals^2)/n)+log(n)*1 > BIC4=n*log(sum(m4$residuals^2)/n)+log(n)*1 > BIC5=n*log(sum(m5$residuals^2)/n)+log(n)*2 > BIC6=n*log(sum(m6$residuals^2)/n)+log(n)*2 > BIC7=n*log(sum(m7$residuals^2)/n)+log(n)*2 > BIC8=n*log(sum(m8$residuals^2)/n)+log(n)*2 > BIC9=n*log(sum(m9$residuals^2)/n)+log(n)*2 > BIC10=n*log(sum(m10$residuals^2)/n)+log(n)*2 > BIC11=n*log(sum(m11$residuals^2)/n)+log(n)*3 > BIC12=n*log(sum(m12$residuals^2)/n)+log(n)*3 > BIC13=n*log(sum(m13$residuals^2)/n)+log(n)*3 > BIC14=n*log(sum(m14$residuals^2)/n)+log(n)*3 > BIC15=n*log(sum(m15$residuals^2)/n)+log(n)*4 > BIC16=n*log(sum(m16$residuals^2)/n)+log(n)*0 > min(AIC1,AIC2,AIC3,AIC4,AIC5,AIC6,AIC7,AIC8,AIC9,AIC10,AIC11,AIC12,AIC13,AIC14,AIC15,AIC16) [1] -1265.073 > AIC12 [1] -1265.073 > min(BIC1,BIC2,BIC3,BIC4,BIC5,BIC6,BIC7,BIC8,BIC9,BIC10,BIC11,BIC12,BIC13,BIC14,BIC15,BIC16) [1] -1252.394

Page 28: Analysis of the Boston Housing Data from the 1970 census

27

> BIC12 [1] -1252.394 # Question 7: > data("BostonHousing",package="mlbench") > BostonHousing <- transform(BostonHousing, logdis = log(dis)) > BostonHousing <- transform(BostonHousing, logmedv = log(medv)) > BostonHousing <- transform(BostonHousing, rmsq = rm*rm) > attach(BostonHousing) > > Group1 <- subset(BostonHousing,BostonHousing$zn!=55.0) > Group2 <- subset(BostonHousing,BostonHousing$zn==55.0) > fitmodel <- lm(logmedv~rmsq+age,data = Group1) > summary(fitmodel) Call: lm(formula = logmedv ~ rmsq + age, data = Group1) Residuals: Min 1Q Median 3Q Max -1.07887 -0.10964 0.03389 0.13020 1.41838 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3360096 0.0729281 32.03 <2e-16 *** rmsq 0.0256286 0.0014406 17.79 <2e-16 *** age -0.0047542 0.0004661 -10.20 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2864 on 500 degrees of freedom Multiple R-squared: 0.5129, Adjusted R-squared: 0.5109 F-statistic: 263.2 on 2 and 500 DF, p-value: < 2.2e-16 > p <- predict(fitmodel,newdata=Group2) > SSPE <- sum((Group2$logmedv-p)^2) > SSPE [1] 0.02835043