lecture 26 omitted variable bias formula revisited specially constructed variables –interaction...

23
Lecture 26 • Omitted Variable Bias formula revisited • Specially constructed variables – Interaction variables – Polynomial terms for curvature – Dummy variables for categorical variables

Post on 21-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Lecture 26

• Omitted Variable Bias formula revisited

• Specially constructed variables– Interaction variables– Polynomial terms for curvature– Dummy variables for categorical variables

Page 2: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Omitted Variable Bias Formula Revisited

• From paper “Re-examining Criminal Behavior: The Importance of Omitted Variable Bias,” by David Mustard, Review of Economics and Statistics, 2003.

• To what extent do changes in the arrest rate alter the willingness of individuals to engage in criminal activity?

• Becker’s economic theory of crime: Those involved in illegal activities respond to incentives in much the same way as those who engage in legal activities respond.

Page 3: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Regressions for economic theory of crime

• Goal: Figure out the causal effect of an increase in the arrest rate on the number of crimes committed.

• When Y=log number of crimes committed in a city is regressed on X=log crime rate in a city, the coefficient on X is -.0119 for murder rate, -.0020 for assault rate and -.0117 for burglary rate (coefficient of -.0117 in log-log regression implies that a 1% decrease in the arrest rate for burglaries is associated with a 1.17% decrease in the number of burglaries).

• Simple regression omits the confounding variable of the conviction rate. What is the direction of the omitted variables bias?

Page 4: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Omitted Variables Bias Formula• = the explanatory variable for which we want to

find its causal effect on y. = confounding variables we control for by including them in regression. = omitted confounding variable.

• Then or equivalently • Formula tells us about direction and magnitude of bias

from omitting a variable in estimating a causal effect.• Formula also applies to least squares estimates, i.e.,

pppp

ppp

ppp

xxxxx

xxxxy

xxxxy

11011

*1

*1

*01

1111011

},,|{

},,|{

},,|{

111*1 p

111*1

ˆˆˆˆ p

1x

pxx ,,2

1px

11*11 p

Page 5: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Application of OVB formula• • y=crime rate

• Here is probably negative. Increase in conviction rate should reduce crimes, holding other variables fixed.

• Mustard presents evidence that is negative. As more people are arrested for a given offense level, amount of evidence against each arrestee decreases.

• If and both negative, then . The estimate that a 1% increase would reduce the burglary rate by 1.17% is an underestimate of the impact of increase in arrest rate on reducing burglary rate (i.e., coefficient in log-log regression is <-.0117, reduction is greater than 1.17%).

11*11 p

ppp

ppp

pp

xratearrestxxratearrestrateconviction

xratearrestxratearresty

rateconvictionratearrestrateconvictionxratearresty

_},,,_|_{

_},,_|{

__}_,,_|{

102

**1

*0

1101

1p

1

*111

*11 p1p

1

Page 6: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Specially Constructed Explanatory Variables

• Interaction variables

• Squared and higher polynomial terms for curvature

• Dummy variables for categorical variables.

Page 7: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Interaction

• Interaction is a three-variable concept. One of these is the response variable (Y) and the other two are explanatory variables (X1 and X2).

• There is an interaction between X1 and X2 if the impact of an increase in X2 on Y depends on the level of X1.

• To incorporate interaction in multiple regression model, we add the explanatory variable . There is evidence of an interaction if the coefficient on is significant (t-test has p-value < .05).

21 * XX

21 * XX

Page 8: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

An experiment to study how noise affects the performance of children tested second grade hyperactive children and a control group of second graders who were not hyperactive. One of the tasks involved solving math problems. The children solved problems under both high-noise and low-noise conditions. Here are the mean scores:

0

50

100

150

200

250

Control Hyperactive

Me

an

Ma

the

ma

tic

s S

co

re

High Noise

Low Noise

Let Y=Mean Mathematics Score, 1X Type of Child (0= Control, 1 = Hyperactive),

2X =Type of Noise (0= Low Noise, 1= High Noise). There is an interaction between type of child and type of noise: Impact of increasing noise from low to high depends on the type of child.

Page 9: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Interaction Model for Pollution Data

precipHCHCnonwhiteducprecip

HCnonwhiteducprecipY

*log*log

}log,,,|{

543210

154

4321

4321

*

)}log,,,|{

)}1log,,,|{

x

xHCxnonwhitxeducxprecipY

xHCxnonwhitxeducxprecipY

Page 10: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Response MORTALITY Summary of Fit RSquare 0.703313 RSquare Adj 0.675842 Root Mean Square Error 35.41441 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 1118.6838 88.82142 12.59 <.0001 PRECIP -1.728878 1.202121 -1.44 0.1562 EDUC -18.72354 6.393312 -2.93 0.0050 NONWHITE 2.3882226 0.633573 3.77 0.0004 Log HC -21.47471 10.85506 -1.98 0.0530 Log HC*Precip 1.2550598 0.334228 3.76 0.0004

There is strong evidence (p-value =.0004) that there is an interaction between hydrocarbons and precipitation. The impact of an increase in hydrocarbons on mean mortality, holding fixed education, nonwhite and precipitation is greater for higher precipitation levels.

Page 11: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Polynomials and Interactions Example

• An analyst working for a fast food chain is asked to construct a multiple regression model to identify new locations that are likely to be profitable. The analyst has for a sample of 25 locations the annual gross revenue of the restaurant (y), the mean annual household income and the mean age of children in the area. Data in fastfoodchain.jmp

Page 12: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Scatte rp lo t M atrix

900

1000

1100

1200

1300

20

25

30

35

5.0

7.5

10.0

12.5

15.0

Revenue

900 1000 1100 1200 1300

Income

20 25 30 35

Age

5.0 7.5 10.0 12.5 15.0

R elationship betw een revenue and incom e and betw een revenue and age is quadratic . M em bers o f re la tively poor or re lative ly a ffluent households are less likely to eat at th is chain ’s restaurants, s ince the restaurants attract m ostly m iddle -incom e custom ers. T he quadratic relationship cannot be easily captured by a transform ation . C urvature betw een y and x fa lls in to tw o quadrants of circle in T ukey ’s B ulg ing R ule.

Page 13: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Polynomial Terms for Curvature• To model a curved relationship between y and x, we can add

squared (and cubic or higher order) terms as explanatory variables.

• Fit as a multiple regression with two explanatory variables and

• To draw a plot of the estimated mean of Y|X, after Fit Model, click red triangle next to Response, Save Columns, Predicted Values. Then click Graph, Overlay Plot and Put Predicted Revenue and Revenue into Y, Columns and Income into X. Left Click on the Box next to Predicted Revenue in the legend and select Connect Points.

2210}|{ XXXY

X 2X

Page 14: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

R e s p o n s e R e v e n u e P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t | I n t e r c e p t - 1 4 5 4 . 5 2 1 2 7 9 . 9 9 2 8 - 5 . 1 9 < . 0 0 0 1 I n c o m e 2 0 9 . 8 1 4 8 1 2 4 . 0 8 4 1 8 . 7 1 < . 0 0 0 1 I n c o m e s q - 4 . 1 7 0 5 0 3 0 . 5 0 4 0 0 9 - 8 . 2 7 < . 0 0 0 1

S t r o n g e v i d e n c e t h a t u s i n g i n c o m e a n d i n c o m e s q u a r e d t e r m p r o v i d e s b e t t e r p r e d i c t i o n s t h a n j u s t u s i n g i n c o m e t e r m ( p - v a l u e < . 0 0 0 1 ) . O v e r l a y P l o t

7 0 0

8 0 0

9 0 0

1 0 0 0

11 0 0

1 2 0 0

1 3 0 0

1 4 0 0

Y

1 0 1 5 2 0 2 5 3 0 3 5 4 0

In c o me

Y R e v e n u e

P re d ic te d R e v e n u e

-150

-100

-50

0

50

100

150

Res

idua

l Rev

enue

15 20 25 30 35

Income

Bivariate Fit of Residual Revenue By Income

Page 15: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Interpreting Coefficients and Tests for Polynomial Model

• Coefficients are not directly interpretable. Change in the mean of Y that is associated with a one unit increase in X depends on X.

• To test whether the multiple regression model with X and X2 as predictors provides better predictions than the multiple regression model with just X, use the p-value of the t-test on the X2 coefficient (null hypothesis is that X2 has a zero coefficient).

• Plot residuals vs. X to determine whether quadratic model is appropriate. If there is still a pattern in the mean, can try a cubic model with X, X2 and X3.

]1)*2[(

][

])1()1([}|{}1|{

21

2210

2210

X

XX

XXXYXY

2X

Page 16: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Regression Model for Fast Food Chain Data

• Interactions and polynomial terms can be combined in a multiple regression model

• For fast food chain data, we consider the model

• This is called a second-order model because it includes all squares and interactions of original explanatory variables.

incomeageincomeage

incomeageincomeagerevenue

****

**},|{

62

52

4

210

Page 17: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

fastfoodchain.jmp results

• Strong evidence of a quadratic relationship between revenue and age, revenue and income. Moderate evidence of an interaction between age and income.

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t|

Intercept -1133.981 320.0193 -3.54 0.0022 Income 173.20317 28.20399 6.14 <.0001 Age 23.549963 32.23447 0.73 0.4739 Income sq -3.726129 0.542156 -6.87 <.0001 Age sq -3.868707 1.179054 -3.28 0.0039 (Income)( Age) 1.9672682 0.944082 2.08 0.0509

Page 18: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Categorical variables

• Categorical (nominal) variables: Variables that define group membership, e.g., sex (male/female), color (blue/green/red), county (Bucks County, Chester County, Delaware County, Philadelphia County).

• Categorical variables can be incorporated into regression through dummy variables.

• They can also be directly incorporated, as is done in JMP.

Page 19: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Fedex data set• Before it was a well-known company, FedEx undertook a campaign to

promote use of its Courier packages (now called Fedex Paks). Sales representatives visited customers and worked to increase their use of the packages. Some of the customers were already aware of the Courier packaging before the promotion began, but it was unknown to others.

• Response variable: Number of Courier package shipments per month. Explanatory variables: (1) Number of contact hours customer had with sales representative (hours of effort), (2) Categorical variable indicating whether or not promotion was effective for customers who were already aware of product (aware)

• Question: Was this promotion more effective for customers who were already aware of the product, or was it more effective for those who had been unaware.

Page 20: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Two sample analysisM

ailin

gs

-10

10

3040

60

8090

NO YES

Aware

Oneway Anova Summary of Fit Rsquare 0.0703 Root Mean Square Error 22.094 Mean of Response 34.616 Observations (or Sum Wgts) 125 t Test NO-YES Assuming equal variances Difference -12.307 t Ratio -3.05084 Std Err Dif 4.034 DF 123 Upper CL Dif -4.322 Prob > |t| 0.0028 Lower CL Dif -20.291 Prob > t 0.9986

Problem with two sample analysis: Hours of effort may be a confounding variable.

Page 21: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Dummy Variables

Define dummy variables Di = 1 if customer “aware” Di = 0 if customer “not aware” Multiple regression model 0 1 2,y hours D hours D

Interpretation: 2 is the difference in the mean mailings

for two groups of customers who have the same hours of contact but one group is aware and one group is not aware.

Note that 2 2 21 0

Page 22: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

Parallel lines regression modelResponse Mailings Summary of Fit

RSquare 0.753889 RSquare Adj 0.749854 Root Mean Square Error 11.41459 Mean of Response 34.616 Observations (or Sum Wgts) 125 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio

Model 2 48691.845 24345.9 186.8554 Error 122 15895.723 130.3 Prob > F

C. Total 124 64587.568 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t|

Intercept -1.094183 2.129472 -0.51 0.6083 Hours of Effort 15.621837 0.848664 18.41 <.0001 D 10.488285 2.086349 5.03 <.0001 When hours of effort is controlled for, being aware has a substantial impact, increasing the mean mailings by an estimated 10.5 with the p-value being <.0001 for the null hypothesis that the aware and not aware groups have the same mean for a fixed hours of effort. Compare to the p-value of .60 for the two-sample analysis. Hours of effort was a confounding variable in the two-sample analysis.

Page 23: Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables

• Parallel Regression Lines Model:

Bivariate Fit of Predicted Mailings By Hours of Effort

0

2030

50

70

Pre

dic

ted

Ma

ilin

gs

0 .5 1 1.5 2 2.5 3 3.5 4 4.5Hours of Effort

Linear Fit Aware=="NO"Linear Fit Aware=="YES"

Linear Fit Aware=="NO"

Predicted Mailings = -1.094183 + 15.621837 Hours of Effort Linear Fit Aware=="YES"

Predicted Mailings = 9.3941021 + 15.621837 Hours of Effort

hourshours

hours

Dhours

Dhoursy

Dhoursy

Dhoursy

1202101,|

100,|

210,|

)(

Response Mailings Whole Model Regression Plot

- 10

10

30

50

70

90

Ma

iling

s

0 . 5 1 1. 52 2. 53 3. 54 4. 5

Hour s of Ef f or t

NO

YES