taehyun jung [email protected] circle, lund university
DESCRIPTION
15.15-17.00 December 10 2012 For Survey of Quantitative Research, NORSI. Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression analysis. Taehyun Jung [email protected] CIRCLE, Lund University. Objectives of this session. Contents. - PowerPoint PPT PresentationTRANSCRIPT
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY
Session 3: Basic techniques for innovation data
analysis. Part II: Introducing regression analysis
Taehyun Jung [email protected]
CIRCLE, Lund University
15.15-17.00 December 10 2012
For Survey of Quantitative Research, NORSI
CIRCLE, Lund University, Sweden
The simple linear regression model (or bivariate linear regression model, 2-variable linear regression model)
–= dependent variable, outcome variable, response variable, explained variable, predicted variable, regressand– = independent variable, explanatory variable, control variable, predictor
variable, regressor, covariate– = error term, disturbance–= intercept parameter–= slope parameter
4
Bivariate Linear Regression Model
CIRCLE, Lund University, Sweden
and are observable (we have data/observations on them) and are unobserved but estimable under certain conditionsu is unobservable
–Model implies that u captures everything that determines y except for x–Omission of the influence of innumerable chance events
Systematic influence: specification error Random influence: e.g. weather variations, etc. the net influence of a large
number of small and independent causes–Measurement error–Human indeterminacy: inherent randomness in human behavior
5
Bivariate Linear Regression Model
CIRCLE, Lund University, Sweden
Parameters cannot be calculated but only estimated – because we do not know the actual values of the disturbances inherent in a sample
of data Estimator: the method of estimation. The formula or recipe by which the data
are transformed into an actual estimate.– Notation conventions for estimates: and , , (Ordinary Least Squares estimator)
“good” or “preferred” estimator?– Computational cost– Least squares– Highest R2
– Unbiasedness– Efficiency
“Preferred” or good estimators of and
6
CIRCLE, Lund University, Sweden
Minimize residuals–Residuals ( ) = actual values )of
the dependent variable – estimated values ( ) of the dependent variable–Minimizes the sum of squared
residuals–Always met by the Ordinary
Least Squares (OLS) estimator
Least squares
7
𝑦
𝒙
residuals
CIRCLE, Lund University, Sweden
How much variation in the dependent variable is explained by variation in the independent variables?
OLS estimator minimizes SSR and, therefore, automatically maximizes –Not use it for determining the proper functional form and the appropriate
independent variables
Highest R2
8
CIRCLE, Lund University, Sweden
Sampling distribution centered over the true population parameter– Expected value of estimated
parameters() is equal to true value of the parameter )
– The mean of the sampling distribution of
–Only one of good property that sampling distribution of an estimator can have
OLS criterion (so far) has nothing to do with sampling distribution–We need further assumptions to make
OLS estimator to be unbiased– How disturbance is distributed is most
important
Unbiasedness
9
CIRCLE, Lund University, Sweden
Would you prefer to obtain your estimate by making a single random draw out of an unbiased sampling distribution with a small variance or out of an unbiased sampling distribution with a large variance?
Best unbiased estimator is efficient
BLUE: Best linear unbiased estimator
Efficiency
10
CIRCLE, Lund University, Sweden
Each value of Y thus has a non-random component, , and a random component, u. The first observation has been decomposed into these two components.
12
CIRCLE, Lund University, Sweden
The discrepancies between the actual and fitted values of Y are known as the residuals.–Note that the values of the residuals are not the same as the values of the
disturbance term
13
CIRCLE, Lund University, Sweden
Deriving linear regression coefficients
14
iiiiii
nnnnnn
nnn
XbbYXbYbXbnbY
XbbYXbYbXbbY
XbbYXbYbXbbY
XbbYXbbYeeRSS
101022
120
2
101022
120
2
110111102
12
120
21
210
21101
221
222
222
...222
)(...)(...
02220 100
ii XbYnbbRSS XbYb 10
02220 02
11
iiii XbYXXbbRSS
YXnYXXnXb iii 221
Conditions for Minimizing RSS
CIRCLE, Lund University, Sweden
Deriving linear regression coefficients (cont’d)
15
XbYb 10
YXnYXXnXb iii 221
21
XX
YYXXb
i
ii
221 XnXYXnYX
bi
ii
CIRCLE, Lund University, Sweden
We chose the parameters of the fitted line so as to minimize the sum of the squares of the residuals. As a result, we derived the expressions for b1 and b2.
16
XXnX1
Y
b0
1101̂ XbbY
1Y
b1
nY
nn XbbY 10ˆ
XbYb 10
21
XX
YYXXb
i
ii
uXY 10 True model:
Fitted line: XbbY 10ˆ
CIRCLE, Lund University, Sweden
hourly earnings in 2002 plotted against years of schooling, defined as highest grade completed, for a sample of 540 respondents from the National Longitudinal Survey of Youth.
17
-20
0
20
40
60
80
100
120
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Years of schooling
Hou
rly e
arni
ngs
($)
CIRCLE, Lund University, Sweden
In this case there is only one variable, S, and its coefficient is 2.46. _cons, in Stata, refers to the constant. The estimate of the intercept is -13.93
Interpretation of a regression equation
18
. reg EARNINGS S
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725-------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444------------------------------------------------------------------------------
CIRCLE, Lund University, Sweden
hourly earnings increase by $2.46 for each extra year of schooling.
Literally, the constant indicates that an individual with no years of education would have to pay $13.93 per hour to be allowed to work.– Nonsense!– the only function of the constant
term is to enable you to draw the regression line at the correct height on the scatter diagram
Interpretation of a regression equation
19
CIRCLE, Lund University, Sweden
You can see that the t statistic for the coefficient of S is enormous. We would reject the null hypothesis that schooling does not affect earnings at the 1% significance level (critical value about 2.59).
Testing a hypothesis relating to a regression coefficient
20
. reg EARNINGS S
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725-------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444------------------------------------------------------------------------------
CIRCLE, Lund University, Sweden
2.455 – 0.232 x 1.965 ≤ b2 ≤ 2.455 + 0.232 x 1.965– The critical value of t at the 5% significance level with 538 degrees of freedom is
1.965. 1.999 ≤ b2 ≤ 2.911
Testing a hypothesis relating to a regression coefficient: Confidence intervals
21
. reg EARNINGS S
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725-------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444------------------------------------------------------------------------------
crit111crit11 s.e.s.e. tbbtbb
CIRCLE, Lund University, Sweden
The null hypothesis that we are going to test is that the model has no explanatory power
k is the number of parameters in the regression equation, which at present is just 2.
n – k is, as with the t statistic, the number of degrees of freedom
F is a monotonically increasing function of R2– Why do we perform the test indirectly,
through F, instead of directly through R2? After all, it would be easy to compute the critical values of R2 from those for F
Hypotheses concerning goodness of fit are tested via the F statistic
22
2
22
)()ˆ(YYYY
TSSESSR
i
i
uXY 10 0:,0: 1110 HH
)/()1()1/(
)(
)1(
)/()1/(),1( 2
2
knRkR
knTSSRSS
kTSSESS
knRSSkESSknkF
222 )ˆ()( eYYYY
RSSESSTSS
CIRCLE, Lund University, Sweden
For simple regression analysis, the F statistic is the square of the t statistic.
23
2
21
21
22
212
2
21
22122
21010
2
2
).(.
1][][
2
ˆ
)2/(
tbesb
XXsbXX
sb
XXbss
XbbXbb
neYY
nRSSESSF
iui
u
iuu
i
i
i
CIRCLE, Lund University, Sweden
Calculation of F statistic
24
. reg EARNINGS S
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725-------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444------------------------------------------------------------------------------
15.112)2540/()1725.01(
1725.0)2/()1(
)2,1( 2
2
nR
RnF
15.11228.172
19322)2540/(92689
19322)2/(
)2,1(
nRSS
ESSnF
CIRCLE, Lund University, Sweden
A.1 The model is linear in parameters and correctly specified.
–Examples of models that are not linear in parameters:
‘Linear in parameters’ means that each term on the right side includes a as a simple factor and there is no built-in relationship among the s.
Assumptions for OLS 1:
26
CIRCLE, Lund University, Sweden
If we tried to regress Y on X, when X is constant, we would find that we would not be able to compute the regression coefficients. Both the numerator and the denominator of the expression for b1 would be equal to zero. We would not be able to obtain b1 either.
If for all .
A.2 There is some variation in the regressor in the sample.
27
21
XX
YYXXb
i
ii
CIRCLE, Lund University, Sweden
We assume that the expected value of the disturbance term in any observation should be zero. Sometimes the disturbance term will be positive, sometimes negative, but it should not have a systematic tendency in either direction.
for all i
Actually, if an intercept is included in the regression equation, it is usually reasonable to assume that this condition is satisfied automatically. The role of the intercept is to pick up any systematic but constant tendency in Y not accounted for by the regressor(s).
A.3 The disturbance term has zero expectation
28
CIRCLE, Lund University, Sweden
We assume that the disturbance term is homoscedastic, meaning that its value in each observation is drawn from a distribution with constant population variance.
Once we have generated the sample, the disturbance term will turn out to be greater in some observations, and smaller in others, but there should not be any reason for it to be more erratic in some observations than in others.
A.4 The disturbance term is homoscedastic
29
22uui
222iuiu uEuE
i
22 )( uiuE
CIRCLE, Lund University, Sweden
OLS estimation still gives unbiased coefficient estimates, but they are no longer BLUE.
This implies that if we still use OLS in the presence of heteroskedasticity, our standard errors could be inappropriate and hence any inferences we make could be misleading.
Whether the standard errors calculated using the usual formulae are too big or too small will depend upon the form of the heteroskedasticity.
Consequences of Using OLS in the Presence of Heteroskedasticity
30
CIRCLE, Lund University, Sweden
an earnings function model where hourly earnings, EARNINGS, depend on years of schooling (highest grade completed), S, and years of work experience, EXP.
32
• Note that the interpretation of the model does not depend on whether S and EXP are correlated or not
• However we do assume that the effects of S and EXP on EARNINGS are additive. The impact of a difference in S on EARNINGS is not affected by the value of EXP, or vice versa.
CIRCLE, Lund University, Sweden
The expression for b1 is a straightforward extension of the expression for it in simple regression analysis.
However, the expressions for the slope coefficients are considerably more complex than that for the slope coefficient in simple regression analysis.
Calculating regression coefficients
33
CIRCLE, Lund University, Sweden
It indicates that earnings increase by $2.68 for every extra year of schooling and by $0.56 for every extra year of work experience.
Intercept: Obviously, this is impossible. The lowest value of S in the sample was 6. We have obtained a nonsense estimate because we have extrapolated too far from the data range
Interpretation of a regression equation
34
. reg EARNINGS S EXP
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010-------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------
20
EXPSINGSNEAR 56.068.249.26ˆ
CIRCLE, Lund University, Sweden
A.1 The model is linear in parameters and correctly specified.
A.2 There does not exist an exact linear relationship among the regressors in the sample.
A.3 The disturbance term has zero expectationA.4 The disturbance term is homoscedasticA.5 The values of the disturbance term have independent
distributionsA.6 The disturbance term has a normal distribution
Properties of the multiple regression coefficients. Only A.2 is different.
35
CIRCLE, Lund University, Sweden
the inclusion of the new term has had a dramatic effect on the coefficient of EXP The high correlation causes the standard error of EXP to be larger than it would have
been if EXP and EXPSQ had been less highly correlated, warning us that the point estimate is unreliable
Multicollinearity
36
. reg EARNINGS S EXP EXPSQ
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------
. reg EARNINGS S EXP
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------
uEXPSQEXPSEARNINGS 3210
.cor EXP EXPSQ(obs=540)
| EXP EXPSQ------+------------------ EXP | 1.0000EXPSQ | 0.9812 1.0000
CIRCLE, Lund University, Sweden
When high correlations among the explanatory variables lead to erratic point estimates of the coefficients, large standard errors and unsatisfactorily low t statistics, the regression is said to be suffering from multicollinearity.– the standard errors and t tests remain valid.
Multicollinearity may also be caused by an approximate linear relationship among the explanatory variables. When there are only 2, an approximate linear relationship means there will be a high correlation, but this is not always the case when there are more than 2.
Note that, multicollinearity does not cause the regression coefficients to be biased.
37
CIRCLE, Lund University, Sweden
Reduce the variance of the disturbance term by including further relevant variables in the model
Increase the number of observations Increase MSD(X2) (the variation in the explanatory variables).
– For example, if you were planning a household survey with the aim of investigating how expenditure patterns vary with income, you should make sure that the sample included relatively rich and relatively poor households as well as middle-income households.
Reduce Combine the correlated variables Drop some of the correlated variables
– However, this approach to multicollinearity is dangerous. It is possible that some of the variables with insignificant coefficients really do belong in the model and that the only reason their coefficients are insignificant is because there is a problem of multicollinearity
38
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
32 ,XXr
CIRCLE, Lund University, Sweden
Use common sense and economic theory.
Avoid Type III errors– Producing the right answer to the wrong
question is called a type III error– place relevance before mathematical
elegance know the context
– Do not perform ignorant statistical analyses
inspect the data– place data cleanliness ahead of
econometric godliness Keep it sensibly simple
– Do not talk Greek without knowing the English translation
look long and hard at your results– apply the laugh test
beware the costs of data mining– E.g. tailoring one’s specification to the
data, resulting in a specification that is misleading
Be prepared to compromise– Should a proxy be used? Can sample
attrition be ignored? Do not confuse significance with
substance Report a sensitivity analysis
Kennedy’s 10 commandments of applied econometrics
39