taehyun jung [email protected] circle, lund university

CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY

Session 3: Basic techniques for innovation data

analysis. Part II: Introducing regression analysis

Taehyun Jung [email protected]

CIRCLE, Lund University

15.15-17.00 December 10 2012

For Survey of Quantitative Research, NORSI

CIRCLE, Lund University, Sweden 2

Objectives of this session

CIRCLE, Lund University, Sweden 3

Contents

CIRCLE, Lund University, Sweden

The simple linear regression model (or bivariate linear regression model, 2-variable linear regression model)

–= dependent variable, outcome variable, response variable, explained variable, predicted variable, regressand– = independent variable, explanatory variable, control variable, predictor

variable, regressor, covariate– = error term, disturbance–= intercept parameter–= slope parameter

4

Bivariate Linear Regression Model


and are observable (we have data/observations on them) and are unobserved but estimable under certain conditionsu is unobservable

–Model implies that u captures everything that determines y except for x–Omission of the influence of innumerable chance events

Systematic influence: specification error Random influence: e.g. weather variations, etc. the net influence of a large

number of small and independent causes–Measurement error–Human indeterminacy: inherent randomness in human behavior

5

Bivariate Linear Regression Model


Parameters cannot be calculated but only estimated – because we do not know the actual values of the disturbances inherent in a sample

of data Estimator: the method of estimation. The formula or recipe by which the data

are transformed into an actual estimate.– Notation conventions for estimates: and , , (Ordinary Least Squares estimator)

“good” or “preferred” estimator?– Computational cost– Least squares– Highest R2

– Unbiasedness– Efficiency

“Preferred” or good estimators of and

6


Minimize residuals–Residuals ( ) = actual values )of

the dependent variable – estimated values ( ) of the dependent variable–Minimizes the sum of squared

residuals–Always met by the Ordinary

Least Squares (OLS) estimator

Least squares

7

𝑦

𝒙

residuals


How much variation in the dependent variable is explained by variation in the independent variables?

OLS estimator minimizes SSR and, therefore, automatically maximizes –Not use it for determining the proper functional form and the appropriate

independent variables

Highest R2

8


Sampling distribution centered over the true population parameter– Expected value of estimated

parameters() is equal to true value of the parameter )

– The mean of the sampling distribution of

–Only one of good property that sampling distribution of an estimator can have

OLS criterion (so far) has nothing to do with sampling distribution–We need further assumptions to make

OLS estimator to be unbiased– How disturbance is distributed is most

important

Unbiasedness

9


Would you prefer to obtain your estimate by making a single random draw out of an unbiased sampling distribution with a small variance or out of an unbiased sampling distribution with a large variance?

Best unbiased estimator is efficient

BLUE: Best linear unbiased estimator

Efficiency

10


Ordinary Least Squares

11


Each value of Y thus has a non-random component, , and a random component, u. The first observation has been decomposed into these two components.

12


The discrepancies between the actual and fitted values of Y are known as the residuals.–Note that the values of the residuals are not the same as the values of the

disturbance term

13


Deriving linear regression coefficients

14

iiiiii

nnnnnn

nnn

XbbYXbYbXbnbY

XbbYXbYbXbbY

XbbYXbYbXbbY

XbbYXbbYeeRSS

101022

120

2

101022

120

2

110111102

12

120

21

210

21101

221

222

222

...222

)(...)(...

02220 100

ii XbYnbbRSS XbYb 10

02220 02

11

iiii XbYXXbbRSS

YXnYXXnXb iii 221

Conditions for Minimizing RSS


Deriving linear regression coefficients (cont’d)

15

XbYb 10

YXnYXXnXb iii 221

21

XX

YYXXb

i

ii

221 XnXYXnYX

bi

ii


We chose the parameters of the fitted line so as to minimize the sum of the squares of the residuals. As a result, we derived the expressions for b1 and b2.

16

XXnX1

Y

b0

1101̂ XbbY

1Y

b1

nY

nn XbbY 10ˆ

XbYb 10

21

XX

YYXXb

i

ii

uXY 10 True model:

Fitted line: XbbY 10ˆ


hourly earnings in 2002 plotted against years of schooling, defined as highest grade completed, for a sample of 540 respondents from the National Longitudinal Survey of Youth.

17

-20

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Years of schooling

Hou

rly e

arni

ngs

($)


In this case there is only one variable, S, and its coefficient is 2.46. _cons, in Stata, refers to the constant. The estimate of the intercept is -13.93

Interpretation of a regression equation

18

. reg EARNINGS S

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725-------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444------------------------------------------------------------------------------


hourly earnings increase by $2.46 for each extra year of schooling.

Literally, the constant indicates that an individual with no years of education would have to pay $13.93 per hour to be allowed to work.– Nonsense!– the only function of the constant

term is to enable you to draw the regression line at the correct height on the scatter diagram


19


You can see that the t statistic for the coefficient of S is enormous. We would reject the null hypothesis that schooling does not affect earnings at the 1% significance level (critical value about 2.59).

Testing a hypothesis relating to a regression coefficient

20

. reg EARNINGS S




2.455 – 0.232 x 1.965 ≤ b2 ≤ 2.455 + 0.232 x 1.965– The critical value of t at the 5% significance level with 538 degrees of freedom is

1.965. 1.999 ≤ b2 ≤ 2.911

Testing a hypothesis relating to a regression coefficient: Confidence intervals

21

. reg EARNINGS S



crit111crit11 s.e.s.e. tbbtbb


The null hypothesis that we are going to test is that the model has no explanatory power

k is the number of parameters in the regression equation, which at present is just 2.

n – k is, as with the t statistic, the number of degrees of freedom

F is a monotonically increasing function of R2– Why do we perform the test indirectly,

through F, instead of directly through R2? After all, it would be easy to compute the critical values of R2 from those for F

Hypotheses concerning goodness of fit are tested via the F statistic

22

2

22

)()ˆ(YYYY

TSSESSR

i

i

uXY 10 0:,0: 1110 HH

)/()1()1/(

)(

)1(

)/()1/(),1( 2

2

knRkR

knTSSRSS

kTSSESS

knRSSkESSknkF

222 )ˆ()( eYYYY

RSSESSTSS


For simple regression analysis, the F statistic is the square of the t statistic.

23

2

21

21

22

212

2

21

22122

21010

2

2

).(.

1][][

2

ˆ

)2/(

tbesb

XXsbXX

sb

XXbss

XbbXbb

neYY

nRSSESSF

iui

u

iuu

i

i

i


Calculation of F statistic

24

. reg EARNINGS S



15.112)2540/()1725.01(

1725.0)2/()1(

)2,1( 2

2

nR

RnF

15.11228.172

19322)2540/(92689

19322)2/(

)2,1(

nRSS

ESSnF


OLS Assumptions

25


A.1 The model is linear in parameters and correctly specified.

–Examples of models that are not linear in parameters:

‘Linear in parameters’ means that each term on the right side includes a as a simple factor and there is no built-in relationship among the s.

Assumptions for OLS 1:

26


If we tried to regress Y on X, when X is constant, we would find that we would not be able to compute the regression coefficients. Both the numerator and the denominator of the expression for b1 would be equal to zero. We would not be able to obtain b1 either.

If for all .

A.2 There is some variation in the regressor in the sample.

27

21

XX

YYXXb

i

ii


We assume that the expected value of the disturbance term in any observation should be zero. Sometimes the disturbance term will be positive, sometimes negative, but it should not have a systematic tendency in either direction.

for all i

Actually, if an intercept is included in the regression equation, it is usually reasonable to assume that this condition is satisfied automatically. The role of the intercept is to pick up any systematic but constant tendency in Y not accounted for by the regressor(s).

A.3 The disturbance term has zero expectation

28


We assume that the disturbance term is homoscedastic, meaning that its value in each observation is drawn from a distribution with constant population variance.

Once we have generated the sample, the disturbance term will turn out to be greater in some observations, and smaller in others, but there should not be any reason for it to be more erratic in some observations than in others.

A.4 The disturbance term is homoscedastic

29

22uui

222iuiu uEuE

i

22 )( uiuE


OLS estimation still gives unbiased coefficient estimates, but they are no longer BLUE.

This implies that if we still use OLS in the presence of heteroskedasticity, our standard errors could be inappropriate and hence any inferences we make could be misleading.

Whether the standard errors calculated using the usual formulae are too big or too small will depend upon the form of the heteroskedasticity.

Consequences of Using OLS in the Presence of Heteroskedasticity

30


Multiple Regression

31


an earnings function model where hourly earnings, EARNINGS, depend on years of schooling (highest grade completed), S, and years of work experience, EXP.

32

• Note that the interpretation of the model does not depend on whether S and EXP are correlated or not

• However we do assume that the effects of S and EXP on EARNINGS are additive. The impact of a difference in S on EARNINGS is not affected by the value of EXP, or vice versa.


The expression for b1 is a straightforward extension of the expression for it in simple regression analysis.

However, the expressions for the slope coefficients are considerably more complex than that for the slope coefficient in simple regression analysis.

Calculating regression coefficients

33


It indicates that earnings increase by $2.68 for every extra year of schooling and by $0.56 for every extra year of work experience.

Intercept: Obviously, this is impossible. The lowest value of S in the sample was 6. We have obtained a nonsense estimate because we have extrapolated too far from the data range


34

. reg EARNINGS S EXP


------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------

20

EXPSINGSNEAR 56.068.249.26ˆ


A.1 The model is linear in parameters and correctly specified.

A.2 There does not exist an exact linear relationship among the regressors in the sample.

A.3 The disturbance term has zero expectationA.4 The disturbance term is homoscedasticA.5 The values of the disturbance term have independent

distributionsA.6 The disturbance term has a normal distribution

Properties of the multiple regression coefficients. Only A.2 is different.

35


the inclusion of the new term has had a dramatic effect on the coefficient of EXP The high correlation causes the standard error of EXP to be larger than it would have

been if EXP and EXPSQ had been less highly correlated, warning us that the point estimate is unreliable

Multicollinearity

36

. reg EARNINGS S EXP EXPSQ

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

. reg EARNINGS S EXP

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------

uEXPSQEXPSEARNINGS 3210

.cor EXP EXPSQ(obs=540)

| EXP EXPSQ------+------------------ EXP | 1.0000EXPSQ | 0.9812 1.0000


When high correlations among the explanatory variables lead to erratic point estimates of the coefficients, large standard errors and unsatisfactorily low t statistics, the regression is said to be suffering from multicollinearity.– the standard errors and t tests remain valid.

Multicollinearity may also be caused by an approximate linear relationship among the explanatory variables. When there are only 2, an approximate linear relationship means there will be a high correlation, but this is not always the case when there are more than 2.

Note that, multicollinearity does not cause the regression coefficients to be biased.

37


Reduce the variance of the disturbance term by including further relevant variables in the model

Increase the number of observations Increase MSD(X2) (the variation in the explanatory variables).

– For example, if you were planning a household survey with the aim of investigating how expenditure patterns vary with income, you should make sure that the sample included relatively rich and relatively poor households as well as middle-income households.

Reduce Combine the correlated variables Drop some of the correlated variables

– However, this approach to multicollinearity is dangerous. It is possible that some of the variables with insignificant coefficients really do belong in the model and that the only reason their coefficients are insignificant is because there is a problem of multicollinearity

38

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

32 ,XXr


Use common sense and economic theory.

Avoid Type III errors– Producing the right answer to the wrong

question is called a type III error– place relevance before mathematical

elegance know the context

– Do not perform ignorant statistical analyses

inspect the data– place data cleanliness ahead of

econometric godliness Keep it sensibly simple

– Do not talk Greek without knowing the English translation

look long and hard at your results– apply the laugh test

beware the costs of data mining– E.g. tailoring one’s specification to the

data, resulting in a specification that is misleading

Be prepared to compromise– Should a proxy be used? Can sample

attrition be ignored? Do not confuse significance with

substance Report a sensitivity analysis

Kennedy’s 10 commandments of applied econometrics

39

taehyun jung [email protected] circle, lund university

Documents

lund university15

regression equation

swedenhourly earnings

unbiased sampling distribution

earnings coef

fitted line

best unbiased estimator

swedenleast squares