multiple regression analysis multiple regression model sections 16.1 - 16.6
TRANSCRIPT
Multiple Regression Analysis
Multiple Regression Model
Sections 16.1 - 16.6
The Model and Assumptions
If we can predict the value of a variable on the basis of one explanatory variable, we might make a better prediction with two or more explanatory variables Expect to reduce the chance component of our
model Hope to reduce the standard error of the estimate Expect to eliminate bias that may result if we
ignore a variable that substantially affects the dependent variable
The Model and Assumptions
The multiple regression model is
where yi is the dependent variable for the ith observation 0 is the Y intercept 1,.. ,k are the population partial regression coefficients x1i, x2i,…xki are the observed values of the independent variables,
X1, X2….Xk.
k = 1,2,3…K explanatory variables
ikikiiii xxxy ....22110
The Model and Assumptions
The assumptions of the model are the same as those discussed for simple regression
The expected value of Y for the given Xs is a linear function of the Xs
The standard deviation of the Y terms for given X values is a constant, designated as y|x
The observations, yi, are statistically independent The distribution of the Y values (error terms) is normal
kikiiii xxxyE ....)( 22110
Interpreting the Partial Regression Coefficients For each X term there is a partial regression
coefficient, k This coefficient measures the change in the
E(Y) given a one unit change in the explanatory variable Xk, holding the remaining explanatory variables
constant controlling for the remaining explanatory variables ceteris parabis Equivalent to a partial derivative in calculus
Method of Least Squares - OLS To estimate the population regression
equation, we use the method of least squares The model written in terms of the sample
notation is
The sample regression equation is
ikikiii exbxbxbby ...22110
kikiii xbxbxbby ...ˆ 22110
Method of Least Squares - OLS Goal is to minimize the distance between the
predicted values of Y, the , and the observed values, yi, that is, minimize the residual, ei
Minimize
iy
222110
22 )...()ˆ( kikiiiiii xbxbxbbySSEyye
Method of Least Squares - OLS Take partial derivatives of SSE with respect to each of the partial
regression coefficients and the intercept Each equation is set equal to zero
This gives us k+1 equations in k+1 unknowns The equations must be independent and non-homogeneous Using matrix algebra or a computer, this system of equations can
be solved With a single explanatory variable, the fitted model is a straight
line With two explanatory variables, the model represents a plane in a
three dimensional space With three or more variables it becomes a hyperplane in higher
dimensional space The sample regression equation is correctly called a regression
surface, but we will call it a regression line
An Example: The Human Capital Model Consider education as an investment in human capital There should be a return on this investment in terms of higher future
earnings Most people accept that earnings tend to rise with schooling levels, but
this knowledge by itself does not imply that individuals should go on for more schooling
More is usually costly Direct payments (tuition) Indirect payments (foregone earnings)
Thus the actual magnitude of the increased earnings with additional years of schooling is important
Can not simply calculate the average earnings for a sample of workers with different education levels Have to consider the effects on earnings of other factors, for example,
experience in the labor market, age, ability, race and sex
An Example: The Human Capital Model
Consider a first simple model (1) Earnings = 0 + 1education + Expect that the coefficient on education will be positive, 1 > 0
Realize that most people have higher earnings as they age, regardless of their education If age and education are positively correlated, the estimated
regression coefficient on education will overstate the marginal impact of education
A better model would account for the effect of age (2) Earnings = 0 + 1education +2age +
A Conceptual Experiment
Multiple regression involves a conceptual experiment that we might not be able to carry out in practice
What we would like to do is to compare individuals with different education levels who are the same age We would then be able to see the effects of
education on average earnings, while controlling for age
Current Population Survey, White Males, March 1991
What is the affect of an additional year of education?
$31,523.24 - 27,970.59 = $3,552.65
All workers are 40 years old n
Average Annual Earnings
Educ = 12 227 $27,970.59
Educ = 13 132 $31,523.24
A Conceptual Experiment
Frequently we do not have large enough data sets to be able to ask this type of question
Multiple regression analysis allows us to perform the conceptual exercise of comparing individuals with the same age and different education levels, even if the sample contains no such pairs of individuals
Sample Data
Data was obtained for the March 1992 Current Population Survey The CPS is the source of the official Government statistics on employment
and unemployment A very important secondary purpose is to collect information such as age,
sex, race, education, income and previous work experience. The survey has been conducted monthly for over 50 years About 57,000 households are interviewed monthly, containing approximately
114,500 persons 15 years and older; based on the civilian non-institutional population
For multiple regression question, sample consists of white male respondents 18-65 years old, who spent at least one week in the labor force in the preceding year and who provided information on wage earnings during the preceding year.
Sample size is 30,040 Students download Multiple Regression Human Capital Hand-out
Sample Statistics
age earn educ
Mean 37.50 27561.92 13.02
Standard Error 0.070 119.610 0.017
Median 36 24000 13
Mode 35 30000 12
Standard Deviation 12.19 20730.89 2.92
Sample Variance 148.54 429769891.23 8.54
Minimum 18 2 0
Maximum 65 199998 20
Count 30040 30040 30040
In 1991, the average white male in the sample was 37.5 years old, had 13.0 years of education and earned $27,561.92.
Correlation Matrix
Second, consider the correlation matrix, which shows the simple correlation coefficients for all pairs of variables
There is a small, but positive correlation between education and age A simple regression of earnings on education will overstate the
effect of education because education is positively correlated with age and age has a strong positive effect on earnings
age earn educ
age 1
earn 0.365051 1
educ 0.072856 0.413496 1
Earnings = 0 + 1education +
b0 =b1 =
Sb0 =
Sb1 =
= Se
educnear 78.293339.10622ˆ Multiple R 0.41349589R Square 0.170978851Adjusted R Square 0.170951252Standard Error 18875.91561Observations 30040
ANOVAdf SS MS F Significance F
Regression 1 2.21E+12 2.21E+12 6195.092532 0Residual 30038 1.07E+13 356300190.2Total 30039 1.29E+13
Coefficients Standard Error t Stat P-valueIntercept -10622.38757 497.2073501 -21.36410005 1.62E-100educ 2933.783915 37.27384753 78.70891012 0
Regression Statistics
Is Education a Significant Explanatory Variable? Use t-test
H0: 1 ≤ 0 No relationship
H1: 1> 0 Positive relationship
t-test statistic = 78.709 and the p-value is 0.000
Reject the H0: 1 ≤ 0 There is a significant positive relationship between
education and earnings
Additional Information from the Analysis For each additional year of schooling,
average earnings increase by $2,933.78 The R2 = .1710
Find that 17.1% of the variation in earnings across workers is explained by variation in education levels
The standard error of the estimate, Se equals $18,876
Earnings = 0 + 1education +2age +
ageeducnear 74.57273.275904.29834ˆ
b1 =
b2 =
b0 =
Multiple R 0.532684023R Square 0.283752268Adjusted R Square 0.283704577Standard Error 17545.43262Observations 30040
ANOVAdf SS MS F Significance F
Regression 2 3.66E+12 1.83E+12 5949.803748 0Residual 30037 9.25E+12 307842206Total 30039 1.29E+13
Coefficients Standard Error t Stat P-valueIntercept -29834.04464 540.0326891 -55.24488654 0educ 2759.729904 34.73889338 79.44207877 0age 572.7382481 8.328296328 68.77015725 0
Regression Statistics
Sb0 =
Sb1 =
Sb2 =
=Se
Interpret the Coefficients
In terms of this problem
For each additional year of schooling, average earnings increase by $2,759.73, controlling for age
For each additional year of age, average earnings increase by $572.74, controlling for schooling
ageeducnear 74.57273.275904.29834ˆ
Prediction
Predict the mean earnings for white male workers who are 30 old and have a college degree
The standard error of the estimate, Se = $17,545
where k = no. of explanatory variables
84.503,31)30(74.572)16(73.275904.29834ˆ near
1
)ˆ( 2
kn
yyS ie
Assessing the Regression as a Whole
Want to assess the performance of the model as a whole H0: 1 = 2 = 3 = …= k = 0
The model has no worth
H1: At least one regression coefficient is not equal to zero The model has worth
If all the b’s are close to zero, then the SSR will approach zero 0
SSE
SSR
Assessing the Regression as a Whole Test Statistic
where k = the number of explanatory variables If the null hypothesis is true, the calculated
test statistic will be close to zero; if the null hypothesis is false, the F test statistic will be “large”
)1(1,
knSSE
kSSR
F knk
Assessing the Regression as a Whole The calculated F test
statistic is compared with the critical F to determine whether the null hypothesis should be rejected If Fk,n-k-1 > F,k,n-k-1 (cv)
reject the H0
⍺
cv F
reject
ANOVA Table in Regression
Multiple R 0.532684023R Square 0.283752268Adjusted R Square 0.283704577Standard Error 17545.43262Observations 30040
ANOVAdf SS MS F Significance F
Regression 2 3.66E+12 1.83E+12 5949.803748 0Residual 30037 9.25E+12 307842206Total 30039 1.29E+13
Coefficients Standard Error t Stat P-valueIntercept -29834.04464 540.0326891 -55.24488654 0educ 2759.729904 34.73889338 79.44207877 0age 572.7382481 8.328296328 68.77015725 0
Regression Statistics
)1(1,
knSSE
kSSR
F knk
SSRSSE
Finally note the p-value, written as Significance F, which equals 0.0000. This tells us that we have a zero probability of observing a test statistic as large as
5,949.8 if the null hypothesis is true. The model has worth.
P-value
Inferences Concerning the Population Regression Coefficients Which explanatory variables have
coefficients significantly different from zero? Perform a hypothesis test for each
explanatory variable Essentially the same t-test used for simple
regression Hypotheses
H0: k = 0
H1: k 0
Inferences Concerning the Population Regression Coefficients The test statistic is
where K = number of independent variables The denominator, , is the standard error of the
regression coefficient, bk Take the standard errors of the regression coefficients
from the computer output
kb
kKn S
bt
01
kbS
Inferences Concerning the Population Regression Coefficients In our model, there are
two explanatory variables There will be two tests
about population regression coefficients
Test whether Education is a significant variable H0: educ ≤ 0
H1: educ > 0
Test whether Age is a significant variable H0: age ≤ 0
H1: age > 0
Let ⍺ = 0.01 t,.01 = 2.326 from the t
tables
T-test
Test statistic: educ Test statistic: age
442.797389.34
73.275930040 t
738.34educbS
77.68328.8
74.57230040 t
328.8agebS
Reject the null hypothesis, one tail test, = .01. Find that education is significantly and positively related to earnings.
Again, we reject the null hypothesis and conclude that age is significantly and positively related to earnings.
Coefficients Standard Error t Stat P-valueIntercept -29834.04464 540.0326891 -55.24488654 0age 572.7382481 8.328296328 68.77015725 0Educ 2759.729904 34.73889338 79.44207877 0
p-values < 0.01
The Coefficient of Determination and the Adjusted R2 The R2 value is still defined as the ratio of the SSR to the SST We see that 28.38% of the variation in earnings is explained by
variation in education and in age The simple regression has an R2 = 0.1710
Appears that adding the new explanatory variable improved the “goodness of fit”
This conclusion can be misleading As we add new explanatory variables to our model, the R2 always
increases, even when the new explanatory variables are not significant
The SSE always decreases as more explanatory variables are added This is a mathematical property and doesn’t depend on the relevance of
the additional variables
The Coefficient of Determination and the Adjusted R2 If we take into account the degrees of freedom SSE/(n-k-1) can
increase or decrease Depending on whether the additional variables are significant explanatory
variables or not Adjust the R2 statistic as follows:
Adjusted R2 can increase if the additional explanatory variables are important Can decrease if the additional explanatory variables are not significant
When comparing regression models with different numbers of explanatory variables, you should compare the adjusted R2 to decide which is the best model
The adjusted R2 1, but can take on a value less than zero if the model is very poor
)1(
)1(1
)1(
)1(12
knSST
nSSE
nSST
knSSEadjR
Online Homework - Chapter 16 Multiple Regression CengageNOW sixteenth assignment