simple and multiple linear regression what should be the

62
MBA Statistics 51-651-00 COURSE #4 Simple and multiple linear regression What should be the sales of ice cream?

Upload: others

Post on 24-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simple and multiple linear regression What should be the

MBAStatistics 51-651-00

COURSE #4

Simple and multiple linear regression

What should be the sales of ice cream?

Page 2: Simple and multiple linear regression What should be the

2

Example:

Before beginning building a movie theater, one must estimate the daily number of people entering the building.

How can we estimate it?There are 2 millions individuals in the city.

Page 3: Simple and multiple linear regression What should be the

3

Possible solutions:

One could realize a local market study. However it is often imprecise, specially for new projects.

One could get data from similar projects in other cities.

Page 4: Simple and multiple linear regression What should be the

4

City 1 2 3 4 5 6 7 8 9 10Attendance (x1000) 10 12 8 10 14 20 30 16 4 12

What do you think?

Can we do better?

Page 5: Simple and multiple linear regression What should be the

5

Probably, taking into account the size of the cityCity 1 2 3 4 5 6 7 8 9 10Attendance (x1 000) 10 12 8 10 14 20 30 16 4 12Size (millions) 0.70 0.90 0.50 0.75 1.40 1.50 2.30 1.40 0.25 0.95

0

5

10

15

20

25

30

35

0.00 0.50 1.00 1.50 2.00 2.50

Page 6: Simple and multiple linear regression What should be the

6

Case study: Ice Cream Sales

The file icecream.xls contains pairs of data representing ice cream sales and temperature recorded that day, for 30 days.Is there a relation between temperature and sales?Can temperature be used to predict ice cream sales?If so what’s the prediction when the temperature is 25?

Page 7: Simple and multiple linear regression What should be the

7

Introduction

One of the principle objectives of statistics is to explain the variability that we observe in data.Linear regression (or linear models) is a statistical tool MUCH USED to study the presenceof a linear relation between a dependent variable Y (quantitative and continuous) and one or more independent variables X1, X2, …, Xp (qualitative and/or quantitative), called independent or explanatory variables.

Page 8: Simple and multiple linear regression What should be the

8

For example, a manager could beinterested in seeing if he could explain a good part of the variability that he observes in sales in his differentsbranches (dependant variable Y) in thelast 12 months, by the area, number of employees, number of payed overtime hours, quality of customer service, number of promotions, etc. ( independentor explanatory variables).

Page 9: Simple and multiple linear regression What should be the

9

A regression model can be used to answer one of the following threeobjectives:

Describe data coming from non experimental studies i.e. we observe reality as it is.Examine the hypothesis (data coming from controled experimental studies).Predict (if we like to take risks!!).

Page 10: Simple and multiple linear regression What should be the

10

Example:

•We are interested in knowing what are theimportant factors that influence or determine the value of a property and we want to build a model that would help us evaluate this value using certain factors. •To do this, we have obtained the total value for a sample of 79 properties in a given region. The following variables have also been collected for each property:

Page 11: Simple and multiple linear regression What should be the

11

Brief glimpse of the data file:house.xls

# of square

feettotal land first outdoor heating

OBS value value # of acres floor condition type

1 199657 63247 1.63 1726 Good NatGas2 78482 38091 0.495 1184 Good NatGas3 119962 37665 0.375 1014 Good Electric4 116492 54062 0.981 1260 Average Electric5 131263 61546 1.14 1314 Average NatGas...

78 253480 57948 0.862 1720 Good Electric79 257037 57489 0.95 2004 Excellnt Electric

# of # of # of completed # of non completed # ofOBS rooms bedroom bathrooms bathrooms fire-places GARAGE

1 8 4 2 1 2 Garage2 6 2 1 0 0 NoGarage3 7 3 2 0 1 Garage4 6 3 2 0 1 Garage5 8 4 2 1 2 NoGarage...

78 10 5 5 1 1 Garage79 9 4 2 2 2 Garage

Page 12: Simple and multiple linear regression What should be the

12

Is there a link between the total value and the different factors?

1400009000040000

450000

350000

250000

150000

50000

Land

Tota

l

Page 13: Simple and multiple linear regression What should be the

13

500 1500 2500 3500

50000

150000

250000

350000

450000

Sq.Feet

Tot

al

6543210

450000

350000

250000

150000

50000

Acre

Tota

l

2 3 4 5 6 7 8

50000

150000

250000

350000

450000

Bedroom

Tota

l

5 10 15

50000

150000

250000

350000

450000

Rooms

Tot

al

Page 14: Simple and multiple linear regression What should be the

14

0 1 2 3

50000

150000

250000

350000

450000

Bathrooms

Tot

al

7654321

450000

350000

250000

150000

50000

Completed Bathrooms

Tot

al

0 1 2 3 4 5 6 7

50000

150000

250000

350000

450000

Fire-place

Tota

l

NoGarage Garage50000

150000

250000

350000

450000

Garage

Tot

al

Page 15: Simple and multiple linear regression What should be the

15

The Pearson correlation coefficient r is used to measure the intensity of the linear relation between two quantitative variables.

The correlation coefficient r will take its values between -1 and 1.If a perfect linear relation exist between X and Y,then r = ±1 (r =1 if X and Y vary in the samedirection and r = -1 if X varies in the opposite direction of Y).If r = 0, there is no linear link between X and Y.The more the r value furthers from 0 to get closerto ±1, the more the linear link intensity between Xand Y becomes larger.

Page 16: Simple and multiple linear regression What should be the

16

Y |

6.5 | * r = 0.035 Y | r = 1| || 31 | *

6.0 | * * 29 | *| 27 | *| 25 | *

5.5 | * * 23 | *| 21 | *| 19 | *

5.0 | * 17 | *| 15 | *| 13 | *

4.5 | * * * 11 | *| || ------------------------------------

4.0 | * * 4 5 6 7 8 9 10 11 12 13 14|-----------------------------------

4 5 6 7 8 9 10 11 12 13 14 X

X

Y | r = -1|

-8.0 | *-10.5 | *-13.0 | *-15.5 | *-18.0 | *-20.5 | *-23.0 | *-25.5 | *-28.0 | *-30.5 | *-33.0 | *

|----------------------------------4 5 6 7 8 9 10 11 12 13 14

X

Page 17: Simple and multiple linear regression What should be the

17

Descriptive statistics

Variable N Mean Median Sta.Deviation Minimum MaximumTotal 79 187253 156761 84401 74365 453744 Land 79 65899 59861 22987 35353 131224 Acre 79 1.579 1.040 1.324 0.290 5.880 Sq.Feet 79 1678 1628 635 672 3501Rooms 79 8.519 8.000 2.401 5 18 Bedrooms 79 3.987 4.000 1.266 2 8 C.Bathro 79 2.241 2.000 1.283 1 7 Bathro 79 0.7215 1.000 0.715 0 3 Fire-pl. 79 1.975 2.000 1.368 0 7

Pearson Correlation CoefficientsTotal Land Acre Sq.Feet Rooms Bedroom C.Bathro Bathro

Land 0.815Acre 0.608 0.918Sq.Feet 0.767 0.516 0.301Rooms 0.626 0.518 0.373 0.563Bedrooms 0.582 0.497 0.382 0.431 0.791C.Bathro 0.626 0.506 0.376 0.457 0.479 0.586Bathro 0.436 0.236 0.074 0.354 0.489 0.166 0.172Fire-pl. 0.548 0.497 0.391 0.365 0.394 0.400 0.486 0.386

Page 18: Simple and multiple linear regression What should be the

18

BE CAREFULL!! it is important to interpret the correlation coefficient with the graph.

r = 0.816 in all cases below

12.5 | 10 || | *| | * * *| * | *

10.0 | * 8 | * *| | *

Y1 | * Y2 || * * |

7.5 | * * 6 | *| * || || * | *

5.0 | * 4 || * || | *| |

2.5 | 2 |----------------------------------- ------------------------------------4 5 6 7 8 9 10 11 12 13 14 4 5 6 7 8 9 10 11 12 13 14

X X

15.0 | Y4 || 12.5 | *| || |

12.5 | * || |

Y3 | 10.0 || |

10.0 | | *| | *| * | *| * 7.5 | *

7.5 | * * | *| * * | *| * * | *| * * | *

5.0 | 5.0 |----------------------------------- -----------------------------4 5 6 7 8 9 10 11 12 13 14 8 19

X X

Page 19: Simple and multiple linear regression What should be the

19

Simple linear regressionTo describe a linear relation between twoquantitative variables or to be able to predict Yfor a given value of X, we use a regression line:

Y = β0 + β1X + εSince any statistical model is only an approximation (we hope the best possible !!) and because the linear link is never perfect , in the model, there is always an error, noted ε. If there was a perfect linear relation between Yand X, the error term would always be equal to 0, and all the variability of Y would be explainedby the independent variable X.

Page 20: Simple and multiple linear regression What should be the

20

So, for a given value of X, we would like to estimate Y. Thus, with the help of the data sample we will estimate the regression model parameters β0and β1 in order to minimize the residuals (errors)sum of squares.The squared correlation coefficient is called thecoefficient of determination and the percentageof the variability of Y explained by X:

R2 = 1 - (n-2)/(n-1){Se /Sy}2, where Se is the standard deviation of the

errors and Sy is the standard deviation of Y.

Page 21: Simple and multiple linear regression What should be the

21

We can also use the adjusted coefficient of determination to indicate the percentage of the variability of Y explained by X:

R2ajusted = 1 - {Se/Sy}2 .

Page 22: Simple and multiple linear regression What should be the

22

Simple linear regression example:MODEL 1.

Regression AnalysisThe regression equation is

Total = 16209 + 102 Sq.Feet

Predictor Coef StDev T PConstant 16209 17447 0.93 0.356Sq.Feet 101.939 9.734 10.47 0.000

S = 54556 R-Sq = 58.8% R-Sq(adj) = 58.2%

Analysis of Variance

Source DF SS MS F PRegression 1 3.26460E+11 3.26460E+11 109.68 0.000Residual Error 77 2.29181E+11 2976374177Total 78 5.55641E+11

Page 23: Simple and multiple linear regression What should be the

23

MODEL 2.The regression equation is : Total = - 347 + 22021 Rooms

Predictor Coef StDev T PConstant -347 27621 -0.01 0.990Rooms 22021 3122 7.05 0.000

S = 66210 R-Sq = 39.3% R-Sq(adj) = 38.5%

Analysis of VarianceSource DF SS MS F PRegression 1 2.18090E+11 2.18090E+11 49.75 0.000Residual Error 77 3.37551E+11 4383775699Total 78 5.55641E+11__________________________________________________________________

MODEL 3.The regression equation is : Total = 32428 + 38829 Bedrooms

Predictor Coef StDev T PConstant 32428 25826 1.26 0.213Bedrooms 38829 6177 6.29 0.000

S = 69056 R-Sq = 33.9% R-Sq(adj) = 33.1%

Analysis of VarianceSource DF SS MS F PRegression 1 1.88445E+11 1.88445E+11 39.52 0.000Residual Error 77 3.67196E+11 4768775127Total 78 5.55641E+11

Page 24: Simple and multiple linear regression What should be the

24

Model 1:– total value = 16209 + 102*( # of squared feet ).– R2 = 58.8%. Thus 58.8% of the variability of the

total value is explained by the # of squared feet .Model 2:– total value = -347 + 22021*(# of rooms ).– R2 = 39.3%. Thus 39.3% of the variability of the

total value is explained by the # of rooms .Model 3:– total value = 32428 + 38829 *(# of bedrooms ).– R2 = 33.9%. Thus 33.9% of the variability of the

total value is explained by the # of bedrooms .

Page 25: Simple and multiple linear regression What should be the

25

Which one of the 3 previous models wouldyou choose and why?

Model 1 because it has the largest value of R2.

Page 26: Simple and multiple linear regression What should be the

26

1-α confidence interval for the mean of thevalues of Y for a specific value of X:

For model 1 and a value of X=1500 sq.ft we obtain the following point estimation :– est. total value = 16 209 + 102*1500 = 169 117$

– 95% confidence interval for the mean of thetotal value for properties of 1500 sq.ft :

[156 418, 181 817]

as calculated by CI-regression.xls

Page 27: Simple and multiple linear regression What should be the

27

1-α confidence interval for a new value of Y(prediction) being given a specific value of X:

For model 1 and a value of X=1500 sq.ft we obtain the following point estimation :– est.total value = 16 209 + 101.939*1500 = 169 117$– 95% confidence interval for a predicted total value

when the area of the first floor is 1500 sq.ft : [59 742, 278 492]

The confidence interval for a predicted value is always larger than for the mean of the value of Y for a specific X .

Page 28: Simple and multiple linear regression What should be the

28

Inference on regression model parameters:

If there is no linear link between Y and X then β1 = 0. So, we want to examine the followinghypothesis :– H0 : β1 = 0 vs H1 : β1 ≠ 0

We will reject H0 when the ‘ p-value ’ is too smallThis test will be valid if– the relation between X and Y is linear– the data are independent– the variance of Y is the same for every value of X.– Y has a normal distribution for every value of X or the

sample size n is large.

Page 29: Simple and multiple linear regression What should be the

29

Multiple linear regressionIt is more likely possible that the variability of thedependent variable Y will be explained not only by one independent variable X, but rather by a linear combination of several independentvariables X1, X2, …, Xp. In this case, the multiple regression model isgiven by:

Y = β0 + β1X1 + β2X2 + … + βpXp + εAlso, using the sample data, we will estimate theregression model parameters β0, β1, …, βp in order to minimize the residuals (errors) sum of squares.

Page 30: Simple and multiple linear regression What should be the

30

The multiple correlation coefficient R2, also called the coefficient of determination, represents thepercentage of the variability of Y explained by theindependent variables X1, X2, …, Xp. In the model, when we add one or more independent variables, R2 increases.The question is to know if R2 increases to a significant degree. Note that we cannot have more independentvariables in the model that there are observations in the sample. (general rule: n ≥ 5p).

Page 31: Simple and multiple linear regression What should be the

31

Example:MODEL 1.The regression equation isTotal = - 89131 + 3.05 Land - 20730 Acre + 43.3 Sq.Feet - 4352 Rooms

+ 10049 Bedroom + 7606 C.Bathro + 18725 Bathro + 882 Fire-pl.

Predictor Coef StDev T PConstant -89131 18302 -4.87 0.000Land 3.0518 0.5260 5.80 0.000Acre -20730 7907 -2.62 0.011Sq.Feet 43.336 7.670 5.65 0.000Rooms -4352 3036 -1.43 0.156Bedroom 10049 5307 1.89 0.062CBathro 7606 3610 2.11 0.039Bathro 18725 6585 2.84 0.006Fire-pl. 882 3184 0.28 0.783

S = 29704 R-Sq = 88.9% R-Sq(adj) = 87.6%

Analysis of VarianceSource DF SS MS F PRegression 8 4.93877E+11 61734659810 69.97 0.000Residual Error 70 61763515565 882335937Total 78 5.55641E+11

Page 32: Simple and multiple linear regression What should be the

32

MODEL 2Regression AnalysisThe regression equation isTotal = - 97512 + 3.11 Land - 21880 Acre + 40.2 Sq.Feet

+ 4411 Bedroom + 8466 C.bathro + 14328 Bathro

Predictor Coef StDev T PConstant -97512 17466 -5.58 0.000Land 3.1103 0.5236 5.94 0.000Acre -21880 7884 -2.78 0.007Sq.Feet 40.195 7.384 5.44 0.000Bedroom 4411 3469 1.27 0.208C.bathro 8466 3488 2.43 0.018Bathro 14328 5266 2.72 0.008

S = 29763 R-Sq = 88.5% R-Sq(adj) = 87.6%

Analysis of VarianceSource DF SS MS F PRegression 6 4.91859E+11 81976430646 92.54 0.000Residual Error 72 63782210167 885864030Total 78 5.55641E+11

Page 33: Simple and multiple linear regression What should be the

33

MODEL 3Regression AnalysisThe regression equation isTotal = - 90408 + 3.20 Land - 22534 Acre + 41.1 Sq.Feet

+ 10234 C.bathro + 14183 Bathro

Predictor Coef StDev T PConstant -90408 16618 -5.44 0.000Land 3.2045 0.5205 6.16 0.000Acre -22534 7901 -2.85 0.006Sq.Feet 41.060 7.383 5.56 0.000C.bathro 10234 3213 3.19 0.002Bathro 14183 5287 2.68 0.009

S = 29889 R-Sq = 88,3% R-Sq(adj) = 87,5%

Analysis of VarianceSource DF SS MS F PRegression 5 4.90426E+11 98085283380 109.80 0.000Residual Error 73 65214377146 893347632Total 78 5.55641E+11

Page 34: Simple and multiple linear regression What should be the

34

Model without the area of the land ( # of acres ) because of the multicolinearity with the land value.

MODEL 4The regression equation isTotal = - 55533 + 1.82 Land + 49.8 Sq.Feet + 11696 C.bathro

+ 18430 Bathro

Predictor Coef StDev T PConstant -55533 11783 -4.71 0.000Land 1.8159 0.1929 9.42 0.000Sq.Feet 49.833 7.028 7.09 0.000C.bathro 11696 3321 3.52 0.001Bathro 18430 5312 3.47 0.001

S = 31297 R-Sq = 87.0% R-Sq(adj) = 86.3%

Analysis of VarianceSource DF SS MS F PRegression 4 4.83160E+11 1.20790E+11 123.32 0.000Residual Error 74 72481137708 979474834Total 78 5.55641E+11

Page 35: Simple and multiple linear regression What should be the

35

Which one of the 4 previous modelswould you choose and why?Probably model 4 because all the independentvariables are significant at the 5% level (i.e. for each β in the model, p-value < 5%) and although R2 is smaller, it is just marginally smaller. Moreover, all the model coefficients make « sense »!In model 1 , the variables ‘ # of rooms ’ and ‘ # of fire-place’ are not statistically significant at the5% level (p-value > 5%). The variable ‘ # of bedrooms ’ is at the limit with a p-value = 0.0624.

Page 36: Simple and multiple linear regression What should be the

36

Which one of the 4 previous models would you choose and why?(continued)

In model 2 the variable ‘ # of bedroom ’ is notstatistically significant at the 5% level.In model 3 (and the previous models), thevariable ‘ # of acres ’ coefficient is negative which is contrary to « common sense » and to what weobserved in the scatter plot and the positive Pearson correlation coefficient (r = 0.608).In models 1 to 3, the negative coefficient for thevariable ‘ # of acres ’ is due to the fact that there is a strong linear relation between the value of the land and the area of the land (r = 0.918): multicolinearity problem.

Page 37: Simple and multiple linear regression What should be the

37

MulticolinearityIf two or more explanatory variables are strongly correlated (> 0.85 in absolute value), one says that there is multicolinearity. It has an influence on the estimation of parameters in the model. If two explanatory variables are highly correlated, then can get rid of one of these variables. Because of the strong correlation, the contribution of the other variable is not significant.The correlation between several pairs of variables can be calculated in Excel using correlation in the Data Analysis toolbox.

Page 38: Simple and multiple linear regression What should be the

38

How can we choose a particular linear regression model among all the possible ones?

There are several techniques:Step by step selection by adding one variable at a time, starting with the most significant one (stepwise, forward).Selection starting from the model in which all the variables are included and removing one variable at a time starting with the least significant (backward).Construct all possible models and choose the best subset of variables according to certain specificcriteria (ex: adjusted R2 , Cp de Mallow.)

Page 39: Simple and multiple linear regression What should be the

39

Example of selection among the best subsets:Best Subsets Regression : Response is Total

B C S e b B F q R d a a i

L A f o r t t r a c e o o h h e

Adj. n r e m o r r p Vars R-Sq R-Sq C-p s d e t s m o o l

1 66.4 65.9 136.8 49262 X 1 58.8 58.2 184.7 54556 X 1 39.3 38.5 307.6 66210 X

2 82.7 82.2 35.9 35564 X X 2 78.8 78.3 60.3 39343 X X 2 74.4 73.7 88.1 43244 X X

3 85.6 85.0 19.5 32637 X X X 3 84.8 84.2 24.5 33521 X X X 3 84.8 84.2 24.9 33591 X X X

4 87.1 86.4 12.2 31115 X X X X 4 87.0 86.3 13.1 31297 X X X X 4 86.6 85.9 15.2 31682 X X X X

5 88.3 87.5 6.9 29889 X X X X X5 87.6 86.7 11.2 30744 X X X X X 5 87.4 86.5 12.4 30979 X X X X X

6 88.5 87.6 7.3 29763 X X X X X X 6 88.3 87.3 8.6 30030 X X X X X X 6 88.3 87.3 8.9 30096 X X X X X X

7 88.9 87.8 7.1 29510 X X X X X X X 7 88.6 87.4 9.1 29924 X X X X X X X 7 88.3 87.2 10.6 30240 X X X X X X X

8 88.9 87.6 9.0 29704 X X X X X X X X

Page 40: Simple and multiple linear regression What should be the

40

Selection of the model without the variable # of acres

Best Subsets Regression : Response is Total

B C S e b B F q R d a a i

L f o r t t r a e o o h h e

Adj. n e m o r r p Vars R-Sq R-Sq C-p s d t s m o o l

1 66.4 65.9 120.6 49262 X 1 58.8 58.2 164.9 54556 X 1 39.3 38.5 278.3 66210 X

2 82.7 82.2 27.6 35564 X X 2 72.7 71.9 86.0 44704 X X 2 72.5 71.8 86.8 44813 X X

3 84.8 84.2 17.2 33521 X X X 3 84.8 84.2 17.6 33591 X X X 3 84.0 83.3 22.3 34467 X X X

4 87.0 86.3 6.9 31297 X X X X4 86.1 85.3 12.1 32352 X X X X 4 85.3 84.5 16.5 33226 X X X X

5 87.3 86.4 6.9 31100 X X X X X 5 87.0 86.1 8.5 31439 X X X X X 5 87.0 86.1 8.9 31509 X X X X X

6 87.8 86.8 6.1 30707 X X X X X X 6 87.3 86.3 8.7 31264 X X X X X X 6 87.0 85.9 10.5 31656 X X X X X X

7 87.8 86.6 8.0 30908 X X X X X X X

Page 41: Simple and multiple linear regression What should be the

41

The selection of the best model is doneaccording to the combination:

The greatest value of R2 adjusted for the number of variables in the model.The smallest value of Cp .For the models with R2 adjusted and comparable Cp, we will choose the model which has the most« common sense » according to the experts in thefield.For the models with R2 adjusted and comparable Cp, the model with the independent variables that are the easiest and least expensive to measure.The model validity.

Page 42: Simple and multiple linear regression What should be the

42

1-α confidence interval for Y mean and a new valueof Y (prediction) being given a specific value combination forX1, X2, …, Xp .

For model 4 and property with a land= 65 000$, sq.ft= 1500, 2 completed bathrooms and 1 not-completed, we obtain the following point estimation :– est. total value = -55 533 + 1.816*65 000 + 49.833*1 500

+ 11 696*2 + 18 430*1 = 179 074$– 95% confidence interval for the mean of the total

value:[170 842, 187 306]

– 95% confidence interval for a total predicted value : [116 173, 241 974]

Page 43: Simple and multiple linear regression What should be the

43

Notes:For a 1500 sq.ft property, the multiple regression model gives a smaller 95% confidence intervals than the simple regression model.Therefore the addition of several other variables in themodel helped to better explain the total value variability and to improve our estimations.If two or more independent variables are correlated wewill say that there is multicolinearity. This caninfluence the value of the parameters in the model . Also, if two independent variables are strongly correlated then only one of the two variables would beincluded in the model, the other one bringing very little additional information.Certain conditions are required for the validity of the model and the corresponding inference (similar to thesimple linear regression ).

Page 44: Simple and multiple linear regression What should be the

44

Dummy variables

How can one take into accountqualitative information in a regression?

Application: Test on two or more means

Page 45: Simple and multiple linear regression What should be the

45

Trick

If a qualitative variable takes twovalues, one defined one dummyvariable taking values 0 or 1.Examples:

Sex: 1 if male, 0 otherwiseGarage: 1 if garage, 0 if not.

Page 46: Simple and multiple linear regression What should be the

46

Trick (continued)

More generally, if a qualitative variable can take m values, one defines (m-1) dummyvariables all takong values 0 or 1.

Example: Sex and job category (executive, white-collar, bue-collar)X1 = 1 if male, 0 otherwise.X2 = 1 si exe, 0 otherwise.X3 = 1 si w-c, 0 otherwise.

Page 47: Simple and multiple linear regression What should be the

47

ExampleOne wants to explain the salary of anemployee (Y) with the following variables:sex, job category and experience.

X1 = 1 if male, 0 otherwise.X2 = 1 if exe, 0 otherwise.X3 = 1 if w-c, 0 otherwise.X4 = years of experience.

Page 48: Simple and multiple linear regression What should be the

48

Example (continued)

Regression model:Y = β0 + β1X1 + β2X2 + β3 X3 + β4X4 + ε

Question: Interpret β0, β1, β2, β3 , β4 .

How do know if women have a smaller salary?

Page 49: Simple and multiple linear regression What should be the

49

“P-value” for one-tailed tests in Excel.

The evaluation of the p-value of a “one-tailed” test hypothesis H1 is not given in general, only the p-value of a “two-tailed” test . For example, in regression, Excel calculates the “p-value” P corresponding toH0 : βi = 0 vs H1 : βi ≠ 0 .

How can we calculate the p-valuecorreponding to one-tailed hypotheses H1?

Page 50: Simple and multiple linear regression What should be the

50

Rules :P: p-value for the two-tailed test.

If H1 is of the form βi > 0 and bi >0, then the “p-value” of theright-tailed is P/2. Otherwise it is 1- P/2.

If H1 is of the form βi < 0 and bi <0, then the “p-value” of theright-tailed is P/2. Otherwise it is 1- P/2.

In other words, the one-tailed p-value is half of the two-tailed p-value when the estimated coefficient has the same sign as thecoefficient in H1. Otherwise, it is 1- “p-value”/2.

Page 51: Simple and multiple linear regression What should be the

51

Question:

One wants to know if having a garageincrease the total value of the property. The hypotheses to be tested should be:

H0: βgarage ≤ 0 vs H1: βgarage > 0

Since bgarage = 22372 > 0, the p-value corresponding to H1: βgarage > 0 is 0.0058/2 = 0.029 < 0.05. The anwser is yes because we accept H1.Does the decision depend on coding?

Page 52: Simple and multiple linear regression What should be the

52

If the dummy is defined by 0 ifthere is a garage and 1 otherwise, we would have got:

Totale = - 72080 + 1,83 Terrain + 47,2 Pied2+ 11535 SbainsC + 18899 Sbains - 22372 Garage

Predictor Coef StDev T PConstant -72080 14175 -5,08 0,000Terrain 1,8342 0,1892 9,69 0,000Pied2 47,175 7,013 6,73 0,000SbainsC 11535 3256 3,54 0,001Sbains 18899 5211 3,63 0,001Garage -22372 11116 -2,01 0,058

S = 30671 R-Sq = 87,6% R-Sq(adj) = 86,8%

Page 53: Simple and multiple linear regression What should be the

53

In that case, the right choice for hypotheses would have been:

H0: βgarage ≥ 0 vs H1: βgarage < 0

The corresponding p-value stays 0.029 = 0.058/2 because bgarage = -22372 < 0 has the same sign as βgarage in H1.

Page 54: Simple and multiple linear regression What should be the

54

Comparison of several means

Suppose one wants to compare therespective means of a quantitative variable Y for two groups: µ1 = mean of group 1, µ2 = mean of group 2.

One can use regression by defining X = 1 for group 1, and X= 0 for group 2.

In this case, β = µ1 – µ2.

Page 55: Simple and multiple linear regression What should be the

55

Hypothesis H1 : µ1> µ2 correspond to H1 : β > 0 .

Hypothesis H1 : µ1< µ2 correspond to H1 : β < 0.

Hypothesis H1 : µ1 ≠ µ2 correspond to H1 : β ≠ 0 .

Page 56: Simple and multiple linear regression What should be the

56

ExampleA manager has some doubts on the (positive) effects of a course in order to improve the speed a given task is performed by employees.

To confirm his belief, he asked a technician to choose at random 10 employees and to measure the time (hours) to complete a task.

Then the same employees attend the course. After the course the employees had to realize a similar task. The results are summarized in the following table:manager.xls

Page 57: Simple and multiple linear regression What should be the

57

Questions:

a) Should the company maintain the formation program? Take α = 5%.b) The technician in charge of the measurements forget to identify employees on the measurements form. What is the conclusion using that data set?Unfortunately, case b is based on a real case.

Page 58: Simple and multiple linear regression What should be the

58

Solution

For situation a), data are paired and wehave to check if the differences « Before– After » are significantly positive. Thep-value is 0.0003 < 0.05 = α. One accepts H1 and the manager conclude that the program should be maintained.

Page 59: Simple and multiple linear regression What should be the

59

In the second case, data are not paired. Onecan use regression with Y = time of execution, and X = 1 for measurements before the courseand X = 0 for measurements after the course.

In that case, the right choice for H1 is:H1: β > 0Results are given by:

Coefficients Standard Error t Stat P-valueIntercept 5.217 0.129989316 40.1340677 4.5838E-19X 0.244 0.183832653 1.32729412 0.20100167

Page 60: Simple and multiple linear regression What should be the

60

Since H1 : β > 0 (which is equivalent to H1 :µbefore> µafter ), andb = 0.244 > 0, the p-value is 0.201/2 = 0.1005 > 0.05.

One accepts H0, so the formation program should’nt be maintained.

This is a very good example of the consequence of the greater variability for two samples compared to a paired sample.

Page 61: Simple and multiple linear regression What should be the

61

Remark: Comparing several means

If one needs to compare the means of k group, for some variable Y, one canuse also regression.For i=1, 2, …, k-1, set:Xi = 1 for group i, 0 otherwise.Then β0 = mean of group k = µk and βi = µi - µk, 1 ≤ i ≤ k-1.

Page 62: Simple and multiple linear regression What should be the

62

Therefore, the regression test where H0 is given by

H0: β1 = β2 = ... = βk-1 = 0

is equivalent to a test where H0 is given byH0: µ1 = µ2 = ... = µk

If H0 is rejected, then we conclude that atleast two means are different.