taehyun jung [email protected] circle, lund university

39
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression analysis Taehyun Jung [email protected] CIRCLE, Lund University 15.15-17.00 December 10 2012 For Survey of Quantitative Research, NORSI

Upload: starr

Post on 25-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

15.15-17.00 December 10 2012 For Survey of Quantitative Research, NORSI. Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression analysis. Taehyun Jung [email protected] CIRCLE, Lund University. Objectives of this session. Contents. - PowerPoint PPT Presentation

TRANSCRIPT

CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY

Session 3: Basic techniques for innovation data

analysis. Part II: Introducing regression analysis

Taehyun Jung [email protected]

CIRCLE, Lund University

15.15-17.00 December 10 2012

For Survey of Quantitative Research, NORSI

CIRCLE, Lund University, Sweden 2

Objectives of this session

CIRCLE, Lund University, Sweden 3

Contents

CIRCLE, Lund University, Sweden

The simple linear regression model (or bivariate linear regression model, 2-variable linear regression model)

–= dependent variable, outcome variable, response variable, explained variable, predicted variable, regressand– = independent variable, explanatory variable, control variable, predictor

variable, regressor, covariate– = error term, disturbance–= intercept parameter–= slope parameter

4

Bivariate Linear Regression Model

CIRCLE, Lund University, Sweden

and are observable (we have data/observations on them) and are unobserved but estimable under certain conditionsu is unobservable

–Model implies that u captures everything that determines y except for x–Omission of the influence of innumerable chance events

Systematic influence: specification error Random influence: e.g. weather variations, etc. the net influence of a large

number of small and independent causes–Measurement error–Human indeterminacy: inherent randomness in human behavior

5

Bivariate Linear Regression Model

CIRCLE, Lund University, Sweden

Parameters cannot be calculated but only estimated – because we do not know the actual values of the disturbances inherent in a sample

of data Estimator: the method of estimation. The formula or recipe by which the data

are transformed into an actual estimate.– Notation conventions for estimates: and , , (Ordinary Least Squares estimator)

“good” or “preferred” estimator?– Computational cost– Least squares– Highest R2

– Unbiasedness– Efficiency

“Preferred” or good estimators of and

6

CIRCLE, Lund University, Sweden

Minimize residuals–Residuals ( ) = actual values )of

the dependent variable – estimated values ( ) of the dependent variable–Minimizes the sum of squared

residuals–Always met by the Ordinary

Least Squares (OLS) estimator

Least squares

7

𝑦

𝒙

residuals

CIRCLE, Lund University, Sweden

How much variation in the dependent variable is explained by variation in the independent variables?

OLS estimator minimizes SSR and, therefore, automatically maximizes –Not use it for determining the proper functional form and the appropriate

independent variables

Highest R2

8

CIRCLE, Lund University, Sweden

Sampling distribution centered over the true population parameter– Expected value of estimated

parameters() is equal to true value of the parameter )

– The mean of the sampling distribution of

–Only one of good property that sampling distribution of an estimator can have

OLS criterion (so far) has nothing to do with sampling distribution–We need further assumptions to make

OLS estimator to be unbiased– How disturbance is distributed is most

important

Unbiasedness

9

CIRCLE, Lund University, Sweden

Would you prefer to obtain your estimate by making a single random draw out of an unbiased sampling distribution with a small variance or out of an unbiased sampling distribution with a large variance?

Best unbiased estimator is efficient

BLUE: Best linear unbiased estimator

Efficiency

10

CIRCLE, Lund University, Sweden

Ordinary Least Squares

11

CIRCLE, Lund University, Sweden

Each value of Y thus has a non-random component, , and a random component, u. The first observation has been decomposed into these two components.

12

CIRCLE, Lund University, Sweden

The discrepancies between the actual and fitted values of Y are known as the residuals.–Note that the values of the residuals are not the same as the values of the

disturbance term

13

CIRCLE, Lund University, Sweden

Deriving linear regression coefficients

14

iiiiii

nnnnnn

nnn

XbbYXbYbXbnbY

XbbYXbYbXbbY

XbbYXbYbXbbY

XbbYXbbYeeRSS

101022

120

2

101022

120

2

110111102

12

120

21

210

21101

221

222

222

...222

)(...)(...

02220 100

ii XbYnbbRSS XbYb 10

02220 02

11

iiii XbYXXbbRSS

YXnYXXnXb iii 221

Conditions for Minimizing RSS

CIRCLE, Lund University, Sweden

Deriving linear regression coefficients (cont’d)

15

XbYb 10

YXnYXXnXb iii 221

21

XX

YYXXb

i

ii

221 XnXYXnYX

bi

ii

CIRCLE, Lund University, Sweden

We chose the parameters of the fitted line so as to minimize the sum of the squares of the residuals. As a result, we derived the expressions for b1 and b2.

16

XXnX1

Y

b0

1101̂ XbbY

1Y

b1

nY

nn XbbY 10ˆ

XbYb 10

21

XX

YYXXb

i

ii

uXY 10 True model:

Fitted line: XbbY 10ˆ

CIRCLE, Lund University, Sweden

hourly earnings in 2002 plotted against years of schooling, defined as highest grade completed, for a sample of 540 respondents from the National Longitudinal Survey of Youth.

17

-20

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Years of schooling

Hou

rly e

arni

ngs

($)

CIRCLE, Lund University, Sweden

In this case there is only one variable, S, and its coefficient is 2.46. _cons, in Stata, refers to the constant. The estimate of the intercept is -13.93

Interpretation of a regression equation

18

. reg EARNINGS S

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725-------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444------------------------------------------------------------------------------

CIRCLE, Lund University, Sweden

hourly earnings increase by $2.46 for each extra year of schooling.

Literally, the constant indicates that an individual with no years of education would have to pay $13.93 per hour to be allowed to work.– Nonsense!– the only function of the constant

term is to enable you to draw the regression line at the correct height on the scatter diagram

Interpretation of a regression equation

19

CIRCLE, Lund University, Sweden

You can see that the t statistic for the coefficient of S is enormous. We would reject the null hypothesis that schooling does not affect earnings at the 1% significance level (critical value about 2.59).

Testing a hypothesis relating to a regression coefficient

20

. reg EARNINGS S

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725-------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444------------------------------------------------------------------------------

CIRCLE, Lund University, Sweden

2.455 – 0.232 x 1.965 ≤ b2 ≤ 2.455 + 0.232 x 1.965– The critical value of t at the 5% significance level with 538 degrees of freedom is

1.965. 1.999 ≤ b2 ≤ 2.911

Testing a hypothesis relating to a regression coefficient: Confidence intervals

21

. reg EARNINGS S

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725-------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444------------------------------------------------------------------------------

crit111crit11 s.e.s.e. tbbtbb

CIRCLE, Lund University, Sweden

The null hypothesis that we are going to test is that the model has no explanatory power

k is the number of parameters in the regression equation, which at present is just 2.

n – k is, as with the t statistic, the number of degrees of freedom

F is a monotonically increasing function of R2– Why do we perform the test indirectly,

through F, instead of directly through R2? After all, it would be easy to compute the critical values of R2 from those for F

Hypotheses concerning goodness of fit are tested via the F statistic

22

2

22

)()ˆ(YYYY

TSSESSR

i

i

uXY 10 0:,0: 1110 HH

)/()1()1/(

)(

)1(

)/()1/(),1( 2

2

knRkR

knTSSRSS

kTSSESS

knRSSkESSknkF

222 )ˆ()( eYYYY

RSSESSTSS

CIRCLE, Lund University, Sweden

For simple regression analysis, the F statistic is the square of the t statistic.

23

2

21

21

22

212

2

21

22122

21010

2

2

).(.

1][][

2

ˆ

)2/(

tbesb

XXsbXX

sb

XXbss

XbbXbb

neYY

nRSSESSF

iui

u

iuu

i

i

i

CIRCLE, Lund University, Sweden

Calculation of F statistic

24

. reg EARNINGS S

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725-------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444------------------------------------------------------------------------------

15.112)2540/()1725.01(

1725.0)2/()1(

)2,1( 2

2

nR

RnF

15.11228.172

19322)2540/(92689

19322)2/(

)2,1(

nRSS

ESSnF

CIRCLE, Lund University, Sweden

OLS Assumptions

25

CIRCLE, Lund University, Sweden

A.1 The model is linear in parameters and correctly specified.

–Examples of models that are not linear in parameters:

‘Linear in parameters’ means that each term on the right side includes a as a simple factor and there is no built-in relationship among the s.

Assumptions for OLS 1:

26

CIRCLE, Lund University, Sweden

If we tried to regress Y on X, when X is constant, we would find that we would not be able to compute the regression coefficients. Both the numerator and the denominator of the expression for b1 would be equal to zero. We would not be able to obtain b1 either.

If for all .

A.2 There is some variation in the regressor in the sample.

27

21

XX

YYXXb

i

ii

CIRCLE, Lund University, Sweden

We assume that the expected value of the disturbance term in any observation should be zero. Sometimes the disturbance term will be positive, sometimes negative, but it should not have a systematic tendency in either direction.

for all i

Actually, if an intercept is included in the regression equation, it is usually reasonable to assume that this condition is satisfied automatically. The role of the intercept is to pick up any systematic but constant tendency in Y not accounted for by the regressor(s).

A.3 The disturbance term has zero expectation

28

CIRCLE, Lund University, Sweden

We assume that the disturbance term is homoscedastic, meaning that its value in each observation is drawn from a distribution with constant population variance.

Once we have generated the sample, the disturbance term will turn out to be greater in some observations, and smaller in others, but there should not be any reason for it to be more erratic in some observations than in others.

A.4 The disturbance term is homoscedastic

29

22uui

222iuiu uEuE

i

22 )( uiuE

CIRCLE, Lund University, Sweden

OLS estimation still gives unbiased coefficient estimates, but they are no longer BLUE.

This implies that if we still use OLS in the presence of heteroskedasticity, our standard errors could be inappropriate and hence any inferences we make could be misleading.

Whether the standard errors calculated using the usual formulae are too big or too small will depend upon the form of the heteroskedasticity.

Consequences of Using OLS in the Presence of Heteroskedasticity

30

CIRCLE, Lund University, Sweden

Multiple Regression

31

CIRCLE, Lund University, Sweden

an earnings function model where hourly earnings, EARNINGS, depend on years of schooling (highest grade completed), S, and years of work experience, EXP.

32

• Note that the interpretation of the model does not depend on whether S and EXP are correlated or not

• However we do assume that the effects of S and EXP on EARNINGS are additive. The impact of a difference in S on EARNINGS is not affected by the value of EXP, or vice versa.

CIRCLE, Lund University, Sweden

The expression for b1 is a straightforward extension of the expression for it in simple regression analysis.

However, the expressions for the slope coefficients are considerably more complex than that for the slope coefficient in simple regression analysis.

Calculating regression coefficients

33

CIRCLE, Lund University, Sweden

It indicates that earnings increase by $2.68 for every extra year of schooling and by $0.56 for every extra year of work experience.

Intercept: Obviously, this is impossible. The lowest value of S in the sample was 6. We have obtained a nonsense estimate because we have extrapolated too far from the data range

Interpretation of a regression equation

34

. reg EARNINGS S EXP

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010-------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------

20

EXPSINGSNEAR 56.068.249.26ˆ

CIRCLE, Lund University, Sweden

A.1 The model is linear in parameters and correctly specified.

A.2 There does not exist an exact linear relationship among the regressors in the sample.

A.3 The disturbance term has zero expectationA.4 The disturbance term is homoscedasticA.5 The values of the disturbance term have independent

distributionsA.6 The disturbance term has a normal distribution

Properties of the multiple regression coefficients. Only A.2 is different.

35

CIRCLE, Lund University, Sweden

the inclusion of the new term has had a dramatic effect on the coefficient of EXP The high correlation causes the standard error of EXP to be larger than it would have

been if EXP and EXPSQ had been less highly correlated, warning us that the point estimate is unreliable

Multicollinearity

36

. reg EARNINGS S EXP EXPSQ

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

. reg EARNINGS S EXP

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------

uEXPSQEXPSEARNINGS 3210

.cor EXP EXPSQ(obs=540)

| EXP EXPSQ------+------------------ EXP | 1.0000EXPSQ | 0.9812 1.0000

CIRCLE, Lund University, Sweden

When high correlations among the explanatory variables lead to erratic point estimates of the coefficients, large standard errors and unsatisfactorily low t statistics, the regression is said to be suffering from multicollinearity.– the standard errors and t tests remain valid.

Multicollinearity may also be caused by an approximate linear relationship among the explanatory variables. When there are only 2, an approximate linear relationship means there will be a high correlation, but this is not always the case when there are more than 2.

Note that, multicollinearity does not cause the regression coefficients to be biased.

37

CIRCLE, Lund University, Sweden

Reduce the variance of the disturbance term by including further relevant variables in the model

Increase the number of observations Increase MSD(X2) (the variation in the explanatory variables).

– For example, if you were planning a household survey with the aim of investigating how expenditure patterns vary with income, you should make sure that the sample included relatively rich and relatively poor households as well as middle-income households.

Reduce Combine the correlated variables Drop some of the correlated variables

– However, this approach to multicollinearity is dangerous. It is possible that some of the variables with insignificant coefficients really do belong in the model and that the only reason their coefficients are insignificant is because there is a problem of multicollinearity

38

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

32 ,XXr

CIRCLE, Lund University, Sweden

Use common sense and economic theory.

Avoid Type III errors– Producing the right answer to the wrong

question is called a type III error– place relevance before mathematical

elegance know the context

– Do not perform ignorant statistical analyses

inspect the data– place data cleanliness ahead of

econometric godliness Keep it sensibly simple

– Do not talk Greek without knowing the English translation

look long and hard at your results– apply the laugh test

beware the costs of data mining– E.g. tailoring one’s specification to the

data, resulting in a specification that is misleading

Be prepared to compromise– Should a proxy be used? Can sample

attrition be ignored? Do not confuse significance with

substance Report a sensitivity analysis

Kennedy’s 10 commandments of applied econometrics

39