simple and multiple linear regression what should be the
TRANSCRIPT
MBAStatistics 51-651-00
COURSE #4
Simple and multiple linear regression
What should be the sales of ice cream?
2
Example:
Before beginning building a movie theater, one must estimate the daily number of people entering the building.
How can we estimate it?There are 2 millions individuals in the city.
3
Possible solutions:
One could realize a local market study. However it is often imprecise, specially for new projects.
One could get data from similar projects in other cities.
4
City 1 2 3 4 5 6 7 8 9 10Attendance (x1000) 10 12 8 10 14 20 30 16 4 12
What do you think?
Can we do better?
5
Probably, taking into account the size of the cityCity 1 2 3 4 5 6 7 8 9 10Attendance (x1 000) 10 12 8 10 14 20 30 16 4 12Size (millions) 0.70 0.90 0.50 0.75 1.40 1.50 2.30 1.40 0.25 0.95
0
5
10
15
20
25
30
35
0.00 0.50 1.00 1.50 2.00 2.50
6
Case study: Ice Cream Sales
The file icecream.xls contains pairs of data representing ice cream sales and temperature recorded that day, for 30 days.Is there a relation between temperature and sales?Can temperature be used to predict ice cream sales?If so what’s the prediction when the temperature is 25?
7
Introduction
One of the principle objectives of statistics is to explain the variability that we observe in data.Linear regression (or linear models) is a statistical tool MUCH USED to study the presenceof a linear relation between a dependent variable Y (quantitative and continuous) and one or more independent variables X1, X2, …, Xp (qualitative and/or quantitative), called independent or explanatory variables.
8
For example, a manager could beinterested in seeing if he could explain a good part of the variability that he observes in sales in his differentsbranches (dependant variable Y) in thelast 12 months, by the area, number of employees, number of payed overtime hours, quality of customer service, number of promotions, etc. ( independentor explanatory variables).
9
A regression model can be used to answer one of the following threeobjectives:
Describe data coming from non experimental studies i.e. we observe reality as it is.Examine the hypothesis (data coming from controled experimental studies).Predict (if we like to take risks!!).
10
Example:
•We are interested in knowing what are theimportant factors that influence or determine the value of a property and we want to build a model that would help us evaluate this value using certain factors. •To do this, we have obtained the total value for a sample of 79 properties in a given region. The following variables have also been collected for each property:
11
Brief glimpse of the data file:house.xls
# of square
feettotal land first outdoor heating
OBS value value # of acres floor condition type
1 199657 63247 1.63 1726 Good NatGas2 78482 38091 0.495 1184 Good NatGas3 119962 37665 0.375 1014 Good Electric4 116492 54062 0.981 1260 Average Electric5 131263 61546 1.14 1314 Average NatGas...
78 253480 57948 0.862 1720 Good Electric79 257037 57489 0.95 2004 Excellnt Electric
# of # of # of completed # of non completed # ofOBS rooms bedroom bathrooms bathrooms fire-places GARAGE
1 8 4 2 1 2 Garage2 6 2 1 0 0 NoGarage3 7 3 2 0 1 Garage4 6 3 2 0 1 Garage5 8 4 2 1 2 NoGarage...
78 10 5 5 1 1 Garage79 9 4 2 2 2 Garage
12
Is there a link between the total value and the different factors?
1400009000040000
450000
350000
250000
150000
50000
Land
Tota
l
13
500 1500 2500 3500
50000
150000
250000
350000
450000
Sq.Feet
Tot
al
6543210
450000
350000
250000
150000
50000
Acre
Tota
l
2 3 4 5 6 7 8
50000
150000
250000
350000
450000
Bedroom
Tota
l
5 10 15
50000
150000
250000
350000
450000
Rooms
Tot
al
14
0 1 2 3
50000
150000
250000
350000
450000
Bathrooms
Tot
al
7654321
450000
350000
250000
150000
50000
Completed Bathrooms
Tot
al
0 1 2 3 4 5 6 7
50000
150000
250000
350000
450000
Fire-place
Tota
l
NoGarage Garage50000
150000
250000
350000
450000
Garage
Tot
al
15
The Pearson correlation coefficient r is used to measure the intensity of the linear relation between two quantitative variables.
The correlation coefficient r will take its values between -1 and 1.If a perfect linear relation exist between X and Y,then r = ±1 (r =1 if X and Y vary in the samedirection and r = -1 if X varies in the opposite direction of Y).If r = 0, there is no linear link between X and Y.The more the r value furthers from 0 to get closerto ±1, the more the linear link intensity between Xand Y becomes larger.
16
Y |
6.5 | * r = 0.035 Y | r = 1| || 31 | *
6.0 | * * 29 | *| 27 | *| 25 | *
5.5 | * * 23 | *| 21 | *| 19 | *
5.0 | * 17 | *| 15 | *| 13 | *
4.5 | * * * 11 | *| || ------------------------------------
4.0 | * * 4 5 6 7 8 9 10 11 12 13 14|-----------------------------------
4 5 6 7 8 9 10 11 12 13 14 X
X
Y | r = -1|
-8.0 | *-10.5 | *-13.0 | *-15.5 | *-18.0 | *-20.5 | *-23.0 | *-25.5 | *-28.0 | *-30.5 | *-33.0 | *
|----------------------------------4 5 6 7 8 9 10 11 12 13 14
X
17
Descriptive statistics
Variable N Mean Median Sta.Deviation Minimum MaximumTotal 79 187253 156761 84401 74365 453744 Land 79 65899 59861 22987 35353 131224 Acre 79 1.579 1.040 1.324 0.290 5.880 Sq.Feet 79 1678 1628 635 672 3501Rooms 79 8.519 8.000 2.401 5 18 Bedrooms 79 3.987 4.000 1.266 2 8 C.Bathro 79 2.241 2.000 1.283 1 7 Bathro 79 0.7215 1.000 0.715 0 3 Fire-pl. 79 1.975 2.000 1.368 0 7
Pearson Correlation CoefficientsTotal Land Acre Sq.Feet Rooms Bedroom C.Bathro Bathro
Land 0.815Acre 0.608 0.918Sq.Feet 0.767 0.516 0.301Rooms 0.626 0.518 0.373 0.563Bedrooms 0.582 0.497 0.382 0.431 0.791C.Bathro 0.626 0.506 0.376 0.457 0.479 0.586Bathro 0.436 0.236 0.074 0.354 0.489 0.166 0.172Fire-pl. 0.548 0.497 0.391 0.365 0.394 0.400 0.486 0.386
18
BE CAREFULL!! it is important to interpret the correlation coefficient with the graph.
r = 0.816 in all cases below
12.5 | 10 || | *| | * * *| * | *
10.0 | * 8 | * *| | *
Y1 | * Y2 || * * |
7.5 | * * 6 | *| * || || * | *
5.0 | * 4 || * || | *| |
2.5 | 2 |----------------------------------- ------------------------------------4 5 6 7 8 9 10 11 12 13 14 4 5 6 7 8 9 10 11 12 13 14
X X
15.0 | Y4 || 12.5 | *| || |
12.5 | * || |
Y3 | 10.0 || |
10.0 | | *| | *| * | *| * 7.5 | *
7.5 | * * | *| * * | *| * * | *| * * | *
5.0 | 5.0 |----------------------------------- -----------------------------4 5 6 7 8 9 10 11 12 13 14 8 19
X X
19
Simple linear regressionTo describe a linear relation between twoquantitative variables or to be able to predict Yfor a given value of X, we use a regression line:
Y = β0 + β1X + εSince any statistical model is only an approximation (we hope the best possible !!) and because the linear link is never perfect , in the model, there is always an error, noted ε. If there was a perfect linear relation between Yand X, the error term would always be equal to 0, and all the variability of Y would be explainedby the independent variable X.
20
So, for a given value of X, we would like to estimate Y. Thus, with the help of the data sample we will estimate the regression model parameters β0and β1 in order to minimize the residuals (errors)sum of squares.The squared correlation coefficient is called thecoefficient of determination and the percentageof the variability of Y explained by X:
R2 = 1 - (n-2)/(n-1){Se /Sy}2, where Se is the standard deviation of the
errors and Sy is the standard deviation of Y.
21
We can also use the adjusted coefficient of determination to indicate the percentage of the variability of Y explained by X:
R2ajusted = 1 - {Se/Sy}2 .
22
Simple linear regression example:MODEL 1.
Regression AnalysisThe regression equation is
Total = 16209 + 102 Sq.Feet
Predictor Coef StDev T PConstant 16209 17447 0.93 0.356Sq.Feet 101.939 9.734 10.47 0.000
S = 54556 R-Sq = 58.8% R-Sq(adj) = 58.2%
Analysis of Variance
Source DF SS MS F PRegression 1 3.26460E+11 3.26460E+11 109.68 0.000Residual Error 77 2.29181E+11 2976374177Total 78 5.55641E+11
23
MODEL 2.The regression equation is : Total = - 347 + 22021 Rooms
Predictor Coef StDev T PConstant -347 27621 -0.01 0.990Rooms 22021 3122 7.05 0.000
S = 66210 R-Sq = 39.3% R-Sq(adj) = 38.5%
Analysis of VarianceSource DF SS MS F PRegression 1 2.18090E+11 2.18090E+11 49.75 0.000Residual Error 77 3.37551E+11 4383775699Total 78 5.55641E+11__________________________________________________________________
MODEL 3.The regression equation is : Total = 32428 + 38829 Bedrooms
Predictor Coef StDev T PConstant 32428 25826 1.26 0.213Bedrooms 38829 6177 6.29 0.000
S = 69056 R-Sq = 33.9% R-Sq(adj) = 33.1%
Analysis of VarianceSource DF SS MS F PRegression 1 1.88445E+11 1.88445E+11 39.52 0.000Residual Error 77 3.67196E+11 4768775127Total 78 5.55641E+11
24
Model 1:– total value = 16209 + 102*( # of squared feet ).– R2 = 58.8%. Thus 58.8% of the variability of the
total value is explained by the # of squared feet .Model 2:– total value = -347 + 22021*(# of rooms ).– R2 = 39.3%. Thus 39.3% of the variability of the
total value is explained by the # of rooms .Model 3:– total value = 32428 + 38829 *(# of bedrooms ).– R2 = 33.9%. Thus 33.9% of the variability of the
total value is explained by the # of bedrooms .
25
Which one of the 3 previous models wouldyou choose and why?
Model 1 because it has the largest value of R2.
26
1-α confidence interval for the mean of thevalues of Y for a specific value of X:
For model 1 and a value of X=1500 sq.ft we obtain the following point estimation :– est. total value = 16 209 + 102*1500 = 169 117$
– 95% confidence interval for the mean of thetotal value for properties of 1500 sq.ft :
[156 418, 181 817]
as calculated by CI-regression.xls
27
1-α confidence interval for a new value of Y(prediction) being given a specific value of X:
For model 1 and a value of X=1500 sq.ft we obtain the following point estimation :– est.total value = 16 209 + 101.939*1500 = 169 117$– 95% confidence interval for a predicted total value
when the area of the first floor is 1500 sq.ft : [59 742, 278 492]
The confidence interval for a predicted value is always larger than for the mean of the value of Y for a specific X .
28
Inference on regression model parameters:
If there is no linear link between Y and X then β1 = 0. So, we want to examine the followinghypothesis :– H0 : β1 = 0 vs H1 : β1 ≠ 0
We will reject H0 when the ‘ p-value ’ is too smallThis test will be valid if– the relation between X and Y is linear– the data are independent– the variance of Y is the same for every value of X.– Y has a normal distribution for every value of X or the
sample size n is large.
29
Multiple linear regressionIt is more likely possible that the variability of thedependent variable Y will be explained not only by one independent variable X, but rather by a linear combination of several independentvariables X1, X2, …, Xp. In this case, the multiple regression model isgiven by:
Y = β0 + β1X1 + β2X2 + … + βpXp + εAlso, using the sample data, we will estimate theregression model parameters β0, β1, …, βp in order to minimize the residuals (errors) sum of squares.
30
The multiple correlation coefficient R2, also called the coefficient of determination, represents thepercentage of the variability of Y explained by theindependent variables X1, X2, …, Xp. In the model, when we add one or more independent variables, R2 increases.The question is to know if R2 increases to a significant degree. Note that we cannot have more independentvariables in the model that there are observations in the sample. (general rule: n ≥ 5p).
31
Example:MODEL 1.The regression equation isTotal = - 89131 + 3.05 Land - 20730 Acre + 43.3 Sq.Feet - 4352 Rooms
+ 10049 Bedroom + 7606 C.Bathro + 18725 Bathro + 882 Fire-pl.
Predictor Coef StDev T PConstant -89131 18302 -4.87 0.000Land 3.0518 0.5260 5.80 0.000Acre -20730 7907 -2.62 0.011Sq.Feet 43.336 7.670 5.65 0.000Rooms -4352 3036 -1.43 0.156Bedroom 10049 5307 1.89 0.062CBathro 7606 3610 2.11 0.039Bathro 18725 6585 2.84 0.006Fire-pl. 882 3184 0.28 0.783
S = 29704 R-Sq = 88.9% R-Sq(adj) = 87.6%
Analysis of VarianceSource DF SS MS F PRegression 8 4.93877E+11 61734659810 69.97 0.000Residual Error 70 61763515565 882335937Total 78 5.55641E+11
32
MODEL 2Regression AnalysisThe regression equation isTotal = - 97512 + 3.11 Land - 21880 Acre + 40.2 Sq.Feet
+ 4411 Bedroom + 8466 C.bathro + 14328 Bathro
Predictor Coef StDev T PConstant -97512 17466 -5.58 0.000Land 3.1103 0.5236 5.94 0.000Acre -21880 7884 -2.78 0.007Sq.Feet 40.195 7.384 5.44 0.000Bedroom 4411 3469 1.27 0.208C.bathro 8466 3488 2.43 0.018Bathro 14328 5266 2.72 0.008
S = 29763 R-Sq = 88.5% R-Sq(adj) = 87.6%
Analysis of VarianceSource DF SS MS F PRegression 6 4.91859E+11 81976430646 92.54 0.000Residual Error 72 63782210167 885864030Total 78 5.55641E+11
33
MODEL 3Regression AnalysisThe regression equation isTotal = - 90408 + 3.20 Land - 22534 Acre + 41.1 Sq.Feet
+ 10234 C.bathro + 14183 Bathro
Predictor Coef StDev T PConstant -90408 16618 -5.44 0.000Land 3.2045 0.5205 6.16 0.000Acre -22534 7901 -2.85 0.006Sq.Feet 41.060 7.383 5.56 0.000C.bathro 10234 3213 3.19 0.002Bathro 14183 5287 2.68 0.009
S = 29889 R-Sq = 88,3% R-Sq(adj) = 87,5%
Analysis of VarianceSource DF SS MS F PRegression 5 4.90426E+11 98085283380 109.80 0.000Residual Error 73 65214377146 893347632Total 78 5.55641E+11
34
Model without the area of the land ( # of acres ) because of the multicolinearity with the land value.
MODEL 4The regression equation isTotal = - 55533 + 1.82 Land + 49.8 Sq.Feet + 11696 C.bathro
+ 18430 Bathro
Predictor Coef StDev T PConstant -55533 11783 -4.71 0.000Land 1.8159 0.1929 9.42 0.000Sq.Feet 49.833 7.028 7.09 0.000C.bathro 11696 3321 3.52 0.001Bathro 18430 5312 3.47 0.001
S = 31297 R-Sq = 87.0% R-Sq(adj) = 86.3%
Analysis of VarianceSource DF SS MS F PRegression 4 4.83160E+11 1.20790E+11 123.32 0.000Residual Error 74 72481137708 979474834Total 78 5.55641E+11
35
Which one of the 4 previous modelswould you choose and why?Probably model 4 because all the independentvariables are significant at the 5% level (i.e. for each β in the model, p-value < 5%) and although R2 is smaller, it is just marginally smaller. Moreover, all the model coefficients make « sense »!In model 1 , the variables ‘ # of rooms ’ and ‘ # of fire-place’ are not statistically significant at the5% level (p-value > 5%). The variable ‘ # of bedrooms ’ is at the limit with a p-value = 0.0624.
36
Which one of the 4 previous models would you choose and why?(continued)
In model 2 the variable ‘ # of bedroom ’ is notstatistically significant at the 5% level.In model 3 (and the previous models), thevariable ‘ # of acres ’ coefficient is negative which is contrary to « common sense » and to what weobserved in the scatter plot and the positive Pearson correlation coefficient (r = 0.608).In models 1 to 3, the negative coefficient for thevariable ‘ # of acres ’ is due to the fact that there is a strong linear relation between the value of the land and the area of the land (r = 0.918): multicolinearity problem.
37
MulticolinearityIf two or more explanatory variables are strongly correlated (> 0.85 in absolute value), one says that there is multicolinearity. It has an influence on the estimation of parameters in the model. If two explanatory variables are highly correlated, then can get rid of one of these variables. Because of the strong correlation, the contribution of the other variable is not significant.The correlation between several pairs of variables can be calculated in Excel using correlation in the Data Analysis toolbox.
38
How can we choose a particular linear regression model among all the possible ones?
There are several techniques:Step by step selection by adding one variable at a time, starting with the most significant one (stepwise, forward).Selection starting from the model in which all the variables are included and removing one variable at a time starting with the least significant (backward).Construct all possible models and choose the best subset of variables according to certain specificcriteria (ex: adjusted R2 , Cp de Mallow.)
39
Example of selection among the best subsets:Best Subsets Regression : Response is Total
B C S e b B F q R d a a i
L A f o r t t r a c e o o h h e
Adj. n r e m o r r p Vars R-Sq R-Sq C-p s d e t s m o o l
1 66.4 65.9 136.8 49262 X 1 58.8 58.2 184.7 54556 X 1 39.3 38.5 307.6 66210 X
2 82.7 82.2 35.9 35564 X X 2 78.8 78.3 60.3 39343 X X 2 74.4 73.7 88.1 43244 X X
3 85.6 85.0 19.5 32637 X X X 3 84.8 84.2 24.5 33521 X X X 3 84.8 84.2 24.9 33591 X X X
4 87.1 86.4 12.2 31115 X X X X 4 87.0 86.3 13.1 31297 X X X X 4 86.6 85.9 15.2 31682 X X X X
5 88.3 87.5 6.9 29889 X X X X X5 87.6 86.7 11.2 30744 X X X X X 5 87.4 86.5 12.4 30979 X X X X X
6 88.5 87.6 7.3 29763 X X X X X X 6 88.3 87.3 8.6 30030 X X X X X X 6 88.3 87.3 8.9 30096 X X X X X X
7 88.9 87.8 7.1 29510 X X X X X X X 7 88.6 87.4 9.1 29924 X X X X X X X 7 88.3 87.2 10.6 30240 X X X X X X X
8 88.9 87.6 9.0 29704 X X X X X X X X
40
Selection of the model without the variable # of acres
Best Subsets Regression : Response is Total
B C S e b B F q R d a a i
L f o r t t r a e o o h h e
Adj. n e m o r r p Vars R-Sq R-Sq C-p s d t s m o o l
1 66.4 65.9 120.6 49262 X 1 58.8 58.2 164.9 54556 X 1 39.3 38.5 278.3 66210 X
2 82.7 82.2 27.6 35564 X X 2 72.7 71.9 86.0 44704 X X 2 72.5 71.8 86.8 44813 X X
3 84.8 84.2 17.2 33521 X X X 3 84.8 84.2 17.6 33591 X X X 3 84.0 83.3 22.3 34467 X X X
4 87.0 86.3 6.9 31297 X X X X4 86.1 85.3 12.1 32352 X X X X 4 85.3 84.5 16.5 33226 X X X X
5 87.3 86.4 6.9 31100 X X X X X 5 87.0 86.1 8.5 31439 X X X X X 5 87.0 86.1 8.9 31509 X X X X X
6 87.8 86.8 6.1 30707 X X X X X X 6 87.3 86.3 8.7 31264 X X X X X X 6 87.0 85.9 10.5 31656 X X X X X X
7 87.8 86.6 8.0 30908 X X X X X X X
41
The selection of the best model is doneaccording to the combination:
The greatest value of R2 adjusted for the number of variables in the model.The smallest value of Cp .For the models with R2 adjusted and comparable Cp, we will choose the model which has the most« common sense » according to the experts in thefield.For the models with R2 adjusted and comparable Cp, the model with the independent variables that are the easiest and least expensive to measure.The model validity.
42
1-α confidence interval for Y mean and a new valueof Y (prediction) being given a specific value combination forX1, X2, …, Xp .
For model 4 and property with a land= 65 000$, sq.ft= 1500, 2 completed bathrooms and 1 not-completed, we obtain the following point estimation :– est. total value = -55 533 + 1.816*65 000 + 49.833*1 500
+ 11 696*2 + 18 430*1 = 179 074$– 95% confidence interval for the mean of the total
value:[170 842, 187 306]
– 95% confidence interval for a total predicted value : [116 173, 241 974]
43
Notes:For a 1500 sq.ft property, the multiple regression model gives a smaller 95% confidence intervals than the simple regression model.Therefore the addition of several other variables in themodel helped to better explain the total value variability and to improve our estimations.If two or more independent variables are correlated wewill say that there is multicolinearity. This caninfluence the value of the parameters in the model . Also, if two independent variables are strongly correlated then only one of the two variables would beincluded in the model, the other one bringing very little additional information.Certain conditions are required for the validity of the model and the corresponding inference (similar to thesimple linear regression ).
44
Dummy variables
How can one take into accountqualitative information in a regression?
Application: Test on two or more means
45
Trick
If a qualitative variable takes twovalues, one defined one dummyvariable taking values 0 or 1.Examples:
Sex: 1 if male, 0 otherwiseGarage: 1 if garage, 0 if not.
46
Trick (continued)
More generally, if a qualitative variable can take m values, one defines (m-1) dummyvariables all takong values 0 or 1.
Example: Sex and job category (executive, white-collar, bue-collar)X1 = 1 if male, 0 otherwise.X2 = 1 si exe, 0 otherwise.X3 = 1 si w-c, 0 otherwise.
47
ExampleOne wants to explain the salary of anemployee (Y) with the following variables:sex, job category and experience.
X1 = 1 if male, 0 otherwise.X2 = 1 if exe, 0 otherwise.X3 = 1 if w-c, 0 otherwise.X4 = years of experience.
48
Example (continued)
Regression model:Y = β0 + β1X1 + β2X2 + β3 X3 + β4X4 + ε
Question: Interpret β0, β1, β2, β3 , β4 .
How do know if women have a smaller salary?
49
“P-value” for one-tailed tests in Excel.
The evaluation of the p-value of a “one-tailed” test hypothesis H1 is not given in general, only the p-value of a “two-tailed” test . For example, in regression, Excel calculates the “p-value” P corresponding toH0 : βi = 0 vs H1 : βi ≠ 0 .
How can we calculate the p-valuecorreponding to one-tailed hypotheses H1?
50
Rules :P: p-value for the two-tailed test.
If H1 is of the form βi > 0 and bi >0, then the “p-value” of theright-tailed is P/2. Otherwise it is 1- P/2.
If H1 is of the form βi < 0 and bi <0, then the “p-value” of theright-tailed is P/2. Otherwise it is 1- P/2.
In other words, the one-tailed p-value is half of the two-tailed p-value when the estimated coefficient has the same sign as thecoefficient in H1. Otherwise, it is 1- “p-value”/2.
51
Question:
One wants to know if having a garageincrease the total value of the property. The hypotheses to be tested should be:
H0: βgarage ≤ 0 vs H1: βgarage > 0
Since bgarage = 22372 > 0, the p-value corresponding to H1: βgarage > 0 is 0.0058/2 = 0.029 < 0.05. The anwser is yes because we accept H1.Does the decision depend on coding?
52
If the dummy is defined by 0 ifthere is a garage and 1 otherwise, we would have got:
Totale = - 72080 + 1,83 Terrain + 47,2 Pied2+ 11535 SbainsC + 18899 Sbains - 22372 Garage
Predictor Coef StDev T PConstant -72080 14175 -5,08 0,000Terrain 1,8342 0,1892 9,69 0,000Pied2 47,175 7,013 6,73 0,000SbainsC 11535 3256 3,54 0,001Sbains 18899 5211 3,63 0,001Garage -22372 11116 -2,01 0,058
S = 30671 R-Sq = 87,6% R-Sq(adj) = 86,8%
53
In that case, the right choice for hypotheses would have been:
H0: βgarage ≥ 0 vs H1: βgarage < 0
The corresponding p-value stays 0.029 = 0.058/2 because bgarage = -22372 < 0 has the same sign as βgarage in H1.
54
Comparison of several means
Suppose one wants to compare therespective means of a quantitative variable Y for two groups: µ1 = mean of group 1, µ2 = mean of group 2.
One can use regression by defining X = 1 for group 1, and X= 0 for group 2.
In this case, β = µ1 – µ2.
55
Hypothesis H1 : µ1> µ2 correspond to H1 : β > 0 .
Hypothesis H1 : µ1< µ2 correspond to H1 : β < 0.
Hypothesis H1 : µ1 ≠ µ2 correspond to H1 : β ≠ 0 .
56
ExampleA manager has some doubts on the (positive) effects of a course in order to improve the speed a given task is performed by employees.
To confirm his belief, he asked a technician to choose at random 10 employees and to measure the time (hours) to complete a task.
Then the same employees attend the course. After the course the employees had to realize a similar task. The results are summarized in the following table:manager.xls
57
Questions:
a) Should the company maintain the formation program? Take α = 5%.b) The technician in charge of the measurements forget to identify employees on the measurements form. What is the conclusion using that data set?Unfortunately, case b is based on a real case.
58
Solution
For situation a), data are paired and wehave to check if the differences « Before– After » are significantly positive. Thep-value is 0.0003 < 0.05 = α. One accepts H1 and the manager conclude that the program should be maintained.
59
In the second case, data are not paired. Onecan use regression with Y = time of execution, and X = 1 for measurements before the courseand X = 0 for measurements after the course.
In that case, the right choice for H1 is:H1: β > 0Results are given by:
Coefficients Standard Error t Stat P-valueIntercept 5.217 0.129989316 40.1340677 4.5838E-19X 0.244 0.183832653 1.32729412 0.20100167
60
Since H1 : β > 0 (which is equivalent to H1 :µbefore> µafter ), andb = 0.244 > 0, the p-value is 0.201/2 = 0.1005 > 0.05.
One accepts H0, so the formation program should’nt be maintained.
This is a very good example of the consequence of the greater variability for two samples compared to a paired sample.
61
Remark: Comparing several means
If one needs to compare the means of k group, for some variable Y, one canuse also regression.For i=1, 2, …, k-1, set:Xi = 1 for group i, 0 otherwise.Then β0 = mean of group k = µk and βi = µi - µk, 1 ≤ i ≤ k-1.
62
Therefore, the regression test where H0 is given by
H0: β1 = β2 = ... = βk-1 = 0
is equivalent to a test where H0 is given byH0: µ1 = µ2 = ... = µk
If H0 is rejected, then we conclude that atleast two means are different.