multiple linear regression

10
An Example for Multiple Linear Regression - by Ravindra Gokhale Reference: Montgomery, D., C., and Runger, G., C. Applied Statistics and Probability for Engineers, Third Edition, John Wiley and Sons, Inc. The table below (on next page) presents data on taste-testing 38 brands of pinot noir wine (the data were first reported in an article by Kwan, Kowalski, and Skogenboe in an article in the Journal of Agricultural and Food Chemistry, Vol. 27, 1979, and it also appears as one of the default data sets in Minitab software). The response variable is y (that is, “quality”) and we wish to find the “best” regression equation that relates quality to the other five parameters. Different aspects of multiple linear regression discussed in class are mentioned in this exercise, with some additional insights.

Upload: independent

Post on 30-Jan-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

An Example for Multiple Linear Regression

- by Ravindra Gokhale

Reference: Montgomery, D., C., and Runger, G., C. Applied Statistics and Probability

for Engineers, Third Edition, John Wiley and Sons, Inc.

The table below (on next page) presents data on taste-testing 38 brands of pinot noir wine (the

data were first reported in an article by Kwan, Kowalski, and Skogenboe in an article in the

Journal of Agricultural and Food Chemistry, Vol. 27, 1979, and it also appears as one of the

default data sets in Minitab software). The response variable is y (that is, “quality”) and we wish

to find the “best” regression equation that relates quality to the other five parameters.

Different aspects of multiple linear regression discussed in class are mentioned in this exercise,

with some additional insights.

The data: (number of data points is 38)

x1 x2 x3 x4 x5 y

Clarity Aroma Body Flavor Oakiness Quality

1 3.3 2.8 3.1 4.1 9.8

1 4.4 4.9 3.5 3.9 12.6

1 3.9 5.3 4.8 4.7 11.9

1 3.9 2.6 3.1 3.6 11.1

1 5.6 5.1 5.5 5.1 13.3

1 4.6 4.7 5 4.1 12.8

1 4.8 4.8 4.8 3.3 12.8

1 5.3 4.5 4.3 5.2 12

1 4.3 4.3 3.9 2.9 13.6

1 4.3 3.9 4.7 3.9 13.9

1 5.1 4.3 4.5 3.6 14.4

0.5 3.3 5.4 4.3 3.6 12.3

0.8 5.9 5.7 7 4.1 16.1

0.7 7.7 6.6 6.7 3.7 16.1

1 7.1 4.4 5.8 4.1 15.5

0.9 5.5 5.6 5.6 4.4 15.5

1 6.3 5.4 4.8 4.6 13.8

1 5 5.5 5.5 4.1 13.8

1 4.6 4.1 4.3 3.1 11.3

0.9 3.4 5 3.4 3.4 7.9

0.9 6.4 5.4 6.6 4.8 15.1

1 5.5 5.3 5.3 3.8 13.5

0.7 4.7 4.1 5 3.7 10.8

0.7 4.1 4 4.1 4 9.5

1 6 5.4 5.7 4.7 12.7

1 4.3 4.6 4.7 4.9 11.6

1 3.9 4 5.1 5.1 11.7

1 5.1 4.9 5 5.1 11.9

1 3.9 4.4 5 4.4 10.8

1 4.5 3.7 2.9 3.9 8.5

1 5.2 4.3 5 6 10.7

0.8 4.2 3.8 3 4.7 9.1

1 3.3 3.5 4.3 4.5 12.1

1 6.8 5 6 5.2 14.9

0.8 5 5.7 5.5 4.8 13.5

0.8 3.5 4.7 4.2 3.3 12.2

0.8 4.3 5.5 3.5 5.8 10.3

0.8 5.2 4.8 5.7 3.5 13.2

Figure below is the matrix of scatter plots for the wine quality data, as constructed by SPSS. We

notice that there are some indications of possible linear relationships between quality and the

regressors, but there is no obvious visual impression of which regressors would be appropriate.

(Of course, this is not relevant from the exam point of view since it was not discussed in class).

OakinessFlavourBodyAromaClarityQuality

Oak

ine

ss

Fla

vo

ur

Bo

dy

Aro

ma

Cla

rity

Qu

ality

We first fit a model with all the five explanatory variables. Call this as “MODEL 1”. The SPSS

output is as follows:

Model Summary

Model R R Square Adjusted R

Square Std. Error of the Estimate

1 .849(a) .721 .677 1.16254

a Predictors: (Constant), Oakiness, Body, Clarity, Aroma, Flavour

ANOVA(b)

Model Sum of

Squares df Mean Square F Sig.

1 Regression 111.540 5 22.308 16.506 .000(a)

Residual 43.248 32 1.352

Total 154.788 37

a Predictors: (Constant), Oakiness, Body, Clarity, Aroma, Flavour b Dependent Variable: Quality Coefficients(a)

Model

Unstandardized Coefficients

Standardized Coefficients t Sig. Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) 3.997 2.232 1.791 .083

Clarity 2.339 1.735 .142 1.349 .187 .790 1.266

Aroma .483 .272 .255 1.771 .086 .420 2.381

Body .273 .333 .110 .821 .418 .486 2.056

Flavour 1.168 .304 .587 3.837 .001 .373 2.682

Oakiness -.684 .271 -.247 -2.522 .017 .912 1.097

a Dependent Variable: Quality

Points to be noted:

R-square = 0.721 = 72.1%

Adjusted R-square = 0.677 = 67.7%

SSR = 111.540 SSE = 43.248 SST = 154.788

Degrees of freedom: for SSR = k = 5; for SSE = n – k – 1 = 38 – 5 – 1 = 32,

for SST = n – 1 = 37

Also note MSR, MSE, and F-value.

Manually, F-ratio has to be compared with numerator df = 5, and denominator df = 32.

Note that p-value is 0.000 (that is, almost negligible). The model is useful.

Verify value of R-square = SSR/SST = 111.540 / 154.788 = 0.721

Also verify R-square adjusted = 1 – MSE/[SST/(n-1)] = 1 – 1.352 / [154.788 / 37] = 0.677.

Although the model is useful and the R-square (and R-square adjusted) is good, the third table

indicates that not all coefficients are significant. For example, “clarity”, “aroma”, and “body” are

not showing significant (p-value > 0.05). Can they be eliminated from the model to form a

simpler model, without compromising much on the R-square?

Also note that the problem of multicollinearity is absent. [An indicator of this is that all VIF values

– as given in the third table are far low than 10]. This is good.

Let us have a clean slate again and start including variables one by one. As per the absolute

values of the “Standardized Coefficients” in the third table, it is highest for “flavor”. So, let us

start with including just this one variable (this is the ‘simple linear regression’). Call this as

“MODEL 2”. The output obtained from SPSS is as follows:

Model Summary

Model R R Square Adjusted R

Square Std. Error of the Estimate

1 .790(a) .624 .614 1.27119

a Predictors: (Constant), Flavour ANOVA(b)

Model Sum of

Squares df Mean Square F Sig.

1 Regression 96.615 1 96.615 59.789 .000(a)

Residual 58.173 36 1.616

Total 154.788 37

a Predictors: (Constant), Flavour b Dependent Variable: Quality Coefficients(a)

Model

Unstandardized Coefficients

Standardized Coefficients t Sig. Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) 4.941 .991 4.986 .000

Flavour 1.572 .203 .790 7.732 .000 1.000 1.000

a Dependent Variable: Quality

Points to be noted:

R-square = 0.624 = 62.4%

Adjusted R-square = 0.614 = 61.4%

SSR = 96.615 SSE = 58.173 SST = 154.788

(Important: SST has not changed. That is, same as the previous model. A little thinking over the

formula used for SST, will show that this is obvious)

Degrees of freedom: for SSR = k = 1; for SSE = n – k – 1 = 38 – 1 – 1 = 36,

for SST = n – 1 = 37

Also note MSR, MSE, and F-value.

Manually, F-ratio has to be compared with numerator df = 1, and denominator df = 36.

Note that p-value is 0.000 (that is, almost negligible). The model is useful.

Verify value of R-square = SSR/SST = 96.615 / 154.788 = 0.624

Also verify R-square adjusted = 1 – MSE/[SST/(n-1)] = 1 – 1.616 / [154.788 / 37] = 0.614.

The third table shows the significance of coefficients. As this is a ‘simple linear regression’, this

t-test is equivalent to the F-test of the second table. [As told in the class]. Thus, the t-test is

showing significance for the coefficient of “flavor”

Note that the test statistic in t-test is 7.732 and the test statistic in F-test is 59.789 and we verify

that (7.732 x 7.732) = 59.784 ≈ 59.789. [Since it is equivalent]

The regression equation can be written as:

quality = 4.941 + 1.572 x flavor

(that is, given a value of “flavor” we will predict the value of “quality” using this equation).

The above equation, although simple, has a low explanatory power, as indicated by R-square.

Can any other variable be added? Going by the same logic as taking the “flavor” variable first,

the next variable that we can try is “aroma”. Note that in presence of all the other variables,

“aroma” was not showing out to be significant (p-value for t-test for “aroma” in first model was

0.086 which is greater than 0.05). However, let us try that.

The output of the model by including “aroma” along with “flavor” (call this as “MODEL 3”) is:

Model Summary

Model R R Square Adjusted R

Square Std. Error of the Estimate

1 .812(a) .659 .639 1.22885

a Predictors: (Constant), Aroma, Flavour ANOVA(b)

Model Sum of

Squares df Mean Square F Sig.

1 Regression 101.936 2 50.968 33.752 .000(a)

Residual 52.852 35 1.510

Total 154.788 37

a Predictors: (Constant), Aroma, Flavour b Dependent Variable: Quality Coefficients(a)

Model

Unstandardized Coefficients

Standardized Coefficients t Sig. Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) 4.346 1.009 4.307 .000

Flavour 1.170 .291 .588 4.027 .000 .457 2.186

Aroma .518 .276 .274 1.877 .069 .457 2.186

a Dependent Variable: Quality

By now, you must be able to interpret the model yourself. The important feature of MODEL 3

over MODEL 2 is: some improvement in R-square. Also both the variables are significant, as

indicated by t-test.

Going back to MODEL 2, a close contender for including the next variable – going by our

“standardized coefficients” logic - (instead of “aroma”) is “oakiness”. Let us try that. Thus fit a

model with “flavor” and ‘oakiness” only. Call this as MODEL 4. The SPSS output is:

Model Summary

Model R R Square Adjusted R

Square Std. Error of the Estimate

1 .813(a) .661 .642 1.22423

a Predictors: (Constant), Oakiness, Flavour ANOVA(b)

Model Sum of

Squares df Mean Square F Sig.

1 Regression 102.332 2 51.166 34.139 .000(a)

Residual 52.456 35 1.499

Total 154.788 37

a Predictors: (Constant), Oakiness, Flavour b Dependent Variable: Quality Coefficients(a)

Model

Unstandardized Coefficients

Standardized Coefficients t Sig. Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) 6.912 1.389 4.977 .000

Flavour 1.642 .199 .825 8.249 .000 .968 1.033

Oakiness -.541 .277 -.195 -1.953 .059 .968 1.033

a Dependent Variable: Quality

Note that, MODEL 4 is almost similar (but a little better) to MODEL 3 (with respect to R-square

and adjusted R-square). The coefficient of “oakiness” is marginally insignificant (p-value of the t-

test is 0.059, slightly greater than 0.05), but that is okay. Notice that VIF values for MODEL 4

are lesser than that of MODEL 3. In that sense also, MODEL 4 is better comparatively. That

may also be a reason for a comparatively lower performance of MODEL 3. That is, even though

the “standardized coefficient” magnitude (the criterion that we use for selecting the variable as

taken from MODEL 1) for “aroma” was higher than “oakiness”, “aroma” was more correlated to

“flavour” than the correlation of “oakiness” to “flavour”.

[Therefore, I had mentioned in class that the “standardized coefficients” – the one that we used

from MODEL 1 to determine which variables to include – are just ‘guidelines’ and may not

always work. However, they are good guidelines, as revealed in building MODEL 2, and then

transition from MODEL 2 to MODEL 3, as well as MODEL 2 to MODEL 4].

As a final exercise, we again take a clean slate and then build MODEL 5 by taking the

explanatory variable “body” alone. The reason for choosing this variable is because, it had the

lowest “standardized coefficient” as per MODEL 1. Also, in MODEL 1, this variable was showing

to be highly insignificant (p-value of 0.418) in presence of the remaining variables. The output of

MODEL 5 is as follows:

Model Summary

Model R R Square Adjusted R

Square Std. Error of the Estimate

1 .549(a) .301 .282 1.73354

a Predictors: (Constant), Body ANOVA(b)

Model Sum of

Squares df Mean Square F Sig.

1 Regression 46.603 1 46.603 15.508 .000(a)

Residual 108.186 36 3.005

Total 154.788 37

a Predictors: (Constant), Body b Dependent Variable: Quality Coefficients(a)

Model

Unstandardized Coefficients

Standardized Coefficients t Sig. Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) 6.058 1.644 3.685 .001

Body 1.362 .346 .549 3.938 .000 1.000 1.000

a Dependent Variable: Quality

The output reveals that although when considered alone “body” is showing to be significant (by

t-test and equivalent F-test), it is a poor choice, as revealed by low values of R-square (0.301)

and R-square adjusted (0.282). Thus, as mentioned in class, statistical significance does not

necessarily imply usefulness. This also verifies that using “standardized coefficients” is a good

guideline to select the variables.

However, in real life, while building models using good software, the software will select the best

variables for you. As mentioned in class, choose a model that is simple (that is, as low number

of variables as possible), but does not compromise much on R-square (and R-square adjusted).

Thus one can try many different combinations and finally choose the one that gives the “best

tradeoff” between simplicity and R-square. Moreover, the practical aspects like which variables

are easy from data collection point of view, cost of data collection, etc. will also play an

important role, in real life.