1 5. endogenous right hand side variables 5.1 the problem of endogeneity bias 5.1 the problem of...
TRANSCRIPT
11
5. Endogenous right hand side 5. Endogenous right hand side variablesvariables
5.1 The problem of endogeneity bias 5.1 The problem of endogeneity bias 5.2 The basic idea underlying the use of 5.2 The basic idea underlying the use of
instrumental variables instrumental variables 5.3 When the endogenous right hand side 5.3 When the endogenous right hand side
variable is continuous variable is continuous 5.4 When the endogenous right hand side 5.4 When the endogenous right hand side
variable is binaryvariable is binary
22
5.1 Endogeneity bias5.1 Endogeneity bias
Consider a simple OLS regression:Consider a simple OLS regression: YYitit = a = a00 + a + a11 X X1it1it + u + uitit
Recall that our estimate of aRecall that our estimate of a11 will be unbiased will be unbiased only if we can assume that Xonly if we can assume that X1it1it is uncorrelated is uncorrelated with the error term (uwith the error term (u itit))
We have discussed two ways to help ensure that We have discussed two ways to help ensure that this assumption is truethis assumption is true First, we should control for any observable variables First, we should control for any observable variables
that affect Ythat affect Yitit and which are correlated with X and which are correlated with X1it1it. For . For example, we should control for Xexample, we should control for X2it2it if X if X2it2it affects Y affects Yitit and and XX2it2it is correlated with X is correlated with X1it1it (see Chapter 2): (see Chapter 2):
YYitit = a = a00 + a + a11 X X1it1it + a + a22 X X2it2it + u + uitit
33
5.1 Endogeneity bias5.1 Endogeneity bias Second, if we have panel data, we can control Second, if we have panel data, we can control
for any unobservable firm-specific for any unobservable firm-specific characteristics (ucharacteristics (u ii) that affect Y) that affect Y itit and which are and which are correlated with the X variables. correlated with the X variables.
From Chapter 4: From Chapter 4: YYitit = a = a00 + a + a11 X X1it1it + a + a22 X X2it2it + u + ui i + e+ eitit
We control for the correlations between uWe control for the correlations between u ii and and the X variables by estimating fixed effects the X variables by estimating fixed effects models.models.
Our estimates of aOur estimates of a11 and a and a22 are unbiased if the X are unbiased if the X variables are uncorrelated with evariables are uncorrelated with e itit. In this case, . In this case, we say that the X variables are “exogenous”.we say that the X variables are “exogenous”.
44
5.1 Endogeneity bias5.1 Endogeneity bias Unfortunately, multiple regression and fixed effects Unfortunately, multiple regression and fixed effects
models do not always ensure that the X variables are models do not always ensure that the X variables are uncorrelated with the error term:uncorrelated with the error term:
if we do not observe all the variables that affect Y and that are if we do not observe all the variables that affect Y and that are correlated with X, multiple regression will not solve the problem. correlated with X, multiple regression will not solve the problem.
if we do not have panel data, the fixed effects models cannot be if we do not have panel data, the fixed effects models cannot be estimated.estimated.
even if we have panel data, the Y and X variables may display even if we have panel data, the Y and X variables may display little variation over time in which case the fixed effects models little variation over time in which case the fixed effects models can be unreliable (Zhou, 2001).can be unreliable (Zhou, 2001).
even if we have panel data and the Y and X variables display even if we have panel data and the Y and X variables display sufficient variation over time, the unobservable variables that are sufficient variation over time, the unobservable variables that are correlated with X may not be constant over time in which case correlated with X may not be constant over time in which case the fixed effects models will not solve the problem.the fixed effects models will not solve the problem.
55
A variable is more likely to be correlated with the error A variable is more likely to be correlated with the error term if it is “endogenous” term if it is “endogenous”
““Endogenous” means that the variable is determined Endogenous” means that the variable is determined within the economic model that we are trying to estimate.within the economic model that we are trying to estimate.
For example, suppose that YFor example, suppose that Y2it2it is an endogenous is an endogenous explanatory variable:explanatory variable: YY1it1it = a = a00 + a + a11 Y Y2it2it + a + a22 X Xitit + u + uit it (1)(1) YY2it2it = b = b00 + b + b11 X Xit it + b+ b22 Z Zitit + v + vit it (2)(2)
Equations (1) and (2) have a “triangular” structure since Equations (1) and (2) have a “triangular” structure since YY2it2it is assumed to affect Y is assumed to affect Y1it1it, but Y, but Y1it1it is assumed not to is assumed not to affect Yaffect Y2it2it
Given this triangular structure, the OLS estimate of aGiven this triangular structure, the OLS estimate of a11 in in equation (1) is unbiased only if vequation (1) is unbiased only if vitit is uncorrelated with u is uncorrelated with uitit
If vIf vitit is correlated with u is correlated with uitit, then Y, then Y2it2it is correlated with u is correlated with u it it
which means that the OLS estimate of awhich means that the OLS estimate of a11 would be would be biasedbiased
To avoid this bias, we must estimate equation (1) To avoid this bias, we must estimate equation (1) “instrumental variables” (IV) regression rather than OLS. “instrumental variables” (IV) regression rather than OLS.
66
Equations (1) and (2) are called “structural” Equations (1) and (2) are called “structural” equations because they describe the economic equations because they describe the economic relationship between Yrelationship between Y1it1it and Y and Y2it2it
We can obtain a “reduced-form” equation by We can obtain a “reduced-form” equation by substituting eq. (2) into eq. (1):substituting eq. (2) into eq. (1): YY1it1it = a = a00 + a + a11 (b (b00 + b + b11 X Xit it + b+ b22 Z Zitit + v + vitit) + a) + a22 X Xitit + u + uit it
In this “reduced-form” equation, all the explanatory In this “reduced-form” equation, all the explanatory variables (Xvariables (Xitit and Z and Zitit) are exogenous) are exogenous
The basic idea underlying IV regression is to The basic idea underlying IV regression is to remove vremove vitit from the Y from the Y1it1it model so that our model so that our
estimate of aestimate of a11 is unbiased. is unbiased.
77
Note that vNote that vitit is removed from the Y is removed from the Y1it1it model if we use the model if we use the
predicted rather than the actual values of Ypredicted rather than the actual values of Y2it2it on the right on the right
hand side.hand side. We predict YWe predict Y2it2it using using allall the exogenous variables in the the exogenous variables in the
system (in our example, we use the two exogenous system (in our example, we use the two exogenous variables Xvariables Xit it and Zand Zitit))
5.2 The basic idea underlying the 5.2 The basic idea underlying the use of instrumental variablesuse of instrumental variables
88
5.2 The basic idea5.2 The basic idea
We then use the predicted rather than the actual values We then use the predicted rather than the actual values of Yof Y2it2it when estimating the Y when estimating the Y1it1it model: model:
The aThe a11 estimate is biased in eq. (3) but it is unbiased in estimate is biased in eq. (3) but it is unbiased in
eq. (4) because the veq. (4) because the vitit term has been removed. term has been removed.
99
In eq. (4) the estimated coefficient for the ZIn eq. (4) the estimated coefficient for the Z itit variable is variable is
We already know the value of from eq. (2): We already know the value of from eq. (2):
ThereforeTherefore It is important to note that the coefficient can be It is important to note that the coefficient can be
estimated only if there is at least one exogenous variable estimated only if there is at least one exogenous variable in the structural model for Yin the structural model for Y2it2it that is excluded from the that is excluded from the
structural model for Ystructural model for Y1it1it
This is the ZThis is the Zitit variable in eq. (2) variable in eq. (2)
1010
In eq. (4) the coefficient is “just” identified because In eq. (4) the coefficient is “just” identified because there is only one exogenous variable (there is only one exogenous variable (ZZitit) that is in the ) that is in the
YY2it2it model and that is excluded from the model and that is excluded from the YY1it1it model model
1111
Suppose we had included ZSuppose we had included Zitit in both models in both models
In this case, the coefficient cannot be identified In this case, the coefficient cannot be identified because we estimate and because we estimate and
In other words, we cannot determine whether the effect of ZIn other words, we cannot determine whether the effect of Z itit on on
YY1it1it is a main effect (a is a main effect (a33) or an indirect effect through Y) or an indirect effect through Y2it2it (a (a11bb22))
Here we say that the system of equations is “under-Here we say that the system of equations is “under-identified”identified”
1212
Suppose we had included two exogenous variables in the Suppose we had included two exogenous variables in the YY2it2it model and we excluded both these variables from the model and we excluded both these variables from the
YY1it1it model model
Now we have estimates of , , , and .Now we have estimates of , , , and . ThereforeTherefore Here we say that the system of equations is “over-Here we say that the system of equations is “over-
identified”identified” In this example, the system is “triangular” because there In this example, the system is “triangular” because there
are two equations and one endogenous right-hand side are two equations and one endogenous right-hand side variablevariable
1313
5.3 When the endogenous right 5.3 When the endogenous right hand side variable is continuoushand side variable is continuous
When the models have a triangular When the models have a triangular structure, the models can be estimated structure, the models can be estimated using the using the ivregressivregress command command The models can be estimated using 2SLS or The models can be estimated using 2SLS or
LIML or GMMLIML or GMM 2SLS is more commonly used in practice2SLS is more commonly used in practice
1414
5.3.1 Estimating triangular models 5.3.1 Estimating triangular models using 2SLS (using 2SLS (ivregressivregress) )
Go to MySiteGo to MySite Open up the housing.dta file which provides data from
50 U.S. states (1980 Census) use "J:\phd\housing.dta", clear pct_population_urban = the % of the population that
lives in urban areas family_income = median annual family income housing_value = median value of private housing rent = median monthly housing rental payments region1 – region 4 = dummy variables for four regions
in the U.S.
1515
Suppose we want to estimate the Suppose we want to estimate the following:following: rent = a0 + a1 pct_population_urban +
a2 housing_value + u housing_value = b0 + b1 family_income +
b2 region2 + b3 region3 + b4 region4 + v This is a triangular system because there
are two equations and one endogenous right hand side variable (housing_value)
If u and v are correlated, the OLS estimate of a2 will be biased in the rent model
1616
If we ignore the endogeneity problem and If we ignore the endogeneity problem and estimate the rent model using simple OLS:estimate the rent model using simple OLS: reg rent housing_value pct_population_urban
To take account of the potential endogeneity To take account of the potential endogeneity problem we use the problem we use the ivregressivregress command: command: ivregress estimator depvar1 [varlist1] (depvar2 =
varlistiv)• estimator is either 2sls or liml or gmm• depvar1 is the dependent variable for the model which has an
endogenous regressor• varlist1 are the exogenous variables in the model that has the
endogenous regressor• depvar2 is the endogenous regressor• varlistiv are the exogenous variables that are believed to
affect the endogenous regressor
1717
The models that we want to estimate are: rent = a0 + a1 pct_population_urban + a2 housing_value + u
housing_value = b0 + b1 family_income + b2 region2 + b3 region3 + b4 region4 + v
The rent model has an endogenous regressor:The rent model has an endogenous regressor: ivregress 2sls rent pct_population_urban (housing_value =
family_income region2 region3 region4) ivregress liml rent pct_population_urban (housing_value =
family_income region2 region3 region4) ivregress gmm rent pct_population_urban (housing_value =
family_income region2 region3 region4)
The housing_value model can be estimated using OLS as there are no endogenous regressors
reg housing_value family_income region2 region3 region4
1818
We should test whether:We should test whether: our chosen instruments are exogenous (i.e., our chosen instruments are exogenous (i.e.,
they should be uncorrelated with the error they should be uncorrelated with the error term) and term) and
it is valid to exclude some of them from the it is valid to exclude some of them from the model that has the endogenous regressor.model that has the endogenous regressor.
If they are not exogenous or they should If they are not exogenous or they should not be excluded, they are not valid not be excluded, they are not valid instruments. instruments.
1919
The tests for instrument validity are also known The tests for instrument validity are also known as tests of “over-identifying” restrictions because as tests of “over-identifying” restrictions because the tests can only be performed if the model with the tests can only be performed if the model with the endogenous regressor is overidentifiedthe endogenous regressor is overidentified
the tests assume that at least one of the chosen instruments is the tests assume that at least one of the chosen instruments is valid (unfortunately this assumption cannot be tested) valid (unfortunately this assumption cannot be tested)
In our example, the instrumented housing_value In our example, the instrumented housing_value variable is overidentified because four of the variable is overidentified because four of the exogenous variables (family_income region2 exogenous variables (family_income region2 region3 region4) are excluded from the rent region3 region4) are excluded from the rent model. model.
If we had excluded only one of these variables, the instrumented If we had excluded only one of these variables, the instrumented housing_value variable would have been “just” identified in housing_value variable would have been “just” identified in which case it would not be possible to test for instrument validity.which case it would not be possible to test for instrument validity.
2020
We obtain the tests for instrument validity by We obtain the tests for instrument validity by typing typing estat overidestat overid after we run after we run ivregressivregress ivregress 2sls rent pct_population_urban ivregress 2sls rent pct_population_urban
(housing_value = family_income region2 region3 (housing_value = family_income region2 region3 region4)region4)
estat overidestat overid
These tests are statistically significant, These tests are statistically significant, which means the chosen instruments are which means the chosen instruments are not valid. not valid.
2121
This is not surprising because we did not This is not surprising because we did not have good reason to assume that they are have good reason to assume that they are exogenous and validly excluded from the exogenous and validly excluded from the rent model. rent model.
For example: For example: family_income is endogenous if family incomes family_income is endogenous if family incomes
depend on housing values and rents depend on housing values and rents • Why would this be true?Why would this be true?
rents may be different across the four regions, so the rents may be different across the four regions, so the region dummies should not be excluded from the rent region dummies should not be excluded from the rent modelmodel
We obtain different statistics for the tests of We obtain different statistics for the tests of instrument validity if the models are estimated instrument validity if the models are estimated using LIML or GMMusing LIML or GMM
However, the conclusions are the same as in our However, the conclusions are the same as in our previous example:previous example: ivregress liml rent pct_population_urban ivregress liml rent pct_population_urban
(housing_value = family_income region2 region3 (housing_value = family_income region2 region3 region4)region4)
estat overidestat overid ivregress gmm rent pct_population_urban ivregress gmm rent pct_population_urban
(housing_value = family_income region2 region3 (housing_value = family_income region2 region3 region4)region4)
estat overidestat overid
2222
Note that we cannot test for instrument validity Note that we cannot test for instrument validity when the endogenous regressor is when the endogenous regressor is justjust identified identified
This is because the test statistics are obtained This is because the test statistics are obtained under the assumption that at least one of the under the assumption that at least one of the instruments is validinstruments is valid
For example:For example: ivregress 2sls rent pct_population_urban (housing_value = ivregress 2sls rent pct_population_urban (housing_value =
family_income)family_income) estat overidestat overid ivregress liml rent pct_population_urban (housing_value = ivregress liml rent pct_population_urban (housing_value =
family_income)family_income) estat overidestat overid ivregress gmm rent pct_population_urban (housing_value = ivregress gmm rent pct_population_urban (housing_value =
family_income)family_income) estat overidestat overid
2323
2424
We can also test whether the coefficient of the We can also test whether the coefficient of the “endogenous” regressor is biased under OLS.“endogenous” regressor is biased under OLS.
We obtain two Hausman tests for endogeneity bias by We obtain two Hausman tests for endogeneity bias by typing typing estat endogenousestat endogenous after we run after we run ivregressivregress
ivregress 2sls rent pct_population_urban (housing_value = ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4)family_income region2 region3 region4)
estat endogenousestat endogenous (The Durbin statistic uses an estimate of the error term’s (The Durbin statistic uses an estimate of the error term’s
variance assuming that the variable being tested is exogenous variance assuming that the variable being tested is exogenous whereas the Wu-Hausman statistic assumes that the variable whereas the Wu-Hausman statistic assumes that the variable being tested is endogenous)being tested is endogenous)
Given these results, we may be tempted to reject the Given these results, we may be tempted to reject the hypothesis that housing_value is exogenous hypothesis that housing_value is exogenous
However, the Hausman tests for endogeneity bias are However, the Hausman tests for endogeneity bias are only reliable if the chosen instruments are valid. In our only reliable if the chosen instruments are valid. In our example they are not, and so we cannot draw example they are not, and so we cannot draw conclusions about the potential for endogeneity bias. conclusions about the potential for endogeneity bias.
2525
Class exercise 5aClass exercise 5a
Using the fees.dta file, estimate the following models for audit Using the fees.dta file, estimate the following models for audit fees and company size:fees and company size: lnaf = a0 + a1 lnta + a2 big6 + u lnta = b0 + b1 ln_age + b2 listed + v where lnaf is the log of audit fees, lnta is the log of total
assets, ln_age is the log of the company’s age in years, listed is a dummy variable indicating whether the company’s shares are publicly traded on a market.
Is the instrumented lnta variable over-identified, just-identified, Is the instrumented lnta variable over-identified, just-identified, or under-identified? Explain.or under-identified? Explain.
Estimate the audit fee model using 2SLS.Estimate the audit fee model using 2SLS. Test the validity of the chosen instrumental variables.Test the validity of the chosen instrumental variables. Test whether the lnta variable is affected by endogeneity bias.Test whether the lnta variable is affected by endogeneity bias. Verify that the test for instrument validity is not available if you Verify that the test for instrument validity is not available if you
change the model so that it is just-identified.change the model so that it is just-identified.
2626
The key to estimating IV models is to find one or The key to estimating IV models is to find one or more “exogenous” variables that explains the more “exogenous” variables that explains the endogenous regressor and that can be safely endogenous regressor and that can be safely excluded from the main equation. excluded from the main equation.
Unfortunately, most accounting studies that use Unfortunately, most accounting studies that use IV regression do not attempt to justify why their IV regression do not attempt to justify why their chosen instruments are exogenous or why they chosen instruments are exogenous or why they can be excluded from the structural model.can be excluded from the structural model.
As a result, Larcker and Rusticus (2010) criticize As a result, Larcker and Rusticus (2010) criticize the way in which accounting studies have the way in which accounting studies have applied IV regressionapplied IV regression A key problem is that the IV results can be A key problem is that the IV results can be veryvery
sensitive to the researcher’s choice of which variables sensitive to the researcher’s choice of which variables to exclude from the structural model and, in many to exclude from the structural model and, in many studies, these variables have been chosen in a very studies, these variables have been chosen in a very arbitrary wayarbitrary way
2727
2828
2929
3030
Larcker and Rusticus (2010) recommend that Larcker and Rusticus (2010) recommend that researchers justify their chosen instruments researchers justify their chosen instruments using theory or economic intuitionusing theory or economic intuition the the estat estat overidoverid test should not be used to select test should not be used to select
instruments on purely statistical grounds particularly instruments on purely statistical grounds particularly as the test is invalid if all of the chosen instruments as the test is invalid if all of the chosen instruments are also invalidare also invalid
When testing instrument validity (When testing instrument validity (estat estat overidoverid) ) and endogeneity bias (and endogeneity bias (estat estat endogendog), it is also ), it is also important to consider your sample size:important to consider your sample size: in large samples, the tests may reject a null in large samples, the tests may reject a null
hypothesis that is “nearly true”. hypothesis that is “nearly true”. in small samples, the tests may fail to reject a null in small samples, the tests may fail to reject a null
hypothesis that is “very false”.hypothesis that is “very false”.
3131
5.3.2 Estimating simultaneous 5.3.2 Estimating simultaneous equations using 3SLS (equations using 3SLS (reg3reg3))
So far we have been examining a triangular So far we have been examining a triangular system. For example, Ysystem. For example, Y2it2it affects Y affects Y1it1it but Y but Y1it1it does not affect Ydoes not affect Y2it2it
YY1it1it = a = a00 + a + a11 Y Y2it2it + a + a22 X Xit it + a+ a33 Z Z2it2it + u + uit it
YY2it2it = b = b00 + b + b22 X Xit it + b+ b33 Z Z1it1it + v + vit it
In a simultaneous system, both dependent In a simultaneous system, both dependent variables affect each othervariables affect each other YY1it1it = a = a00 + a + a11 Y Y2it2it + a + a22 X Xit it + a+ a33 Z Z2it2it + u + uit it
YY2it2it = b = b00 + b + b11 Y Y1it1it + b + b22 X Xit it + b+ b33 Z Z1it1it + v + vitit
3232
YY1it1it = a = a00 + a + a11 Y Y2it2it + a + a22 X Xit it + a+ a33 Z Z2it2it + u + uit it
YY2it2it = b = b00 + b + b11 Y Y1it1it + b + b22 X Xit it + b+ b33 Z Z1it1it + v + vitit
In this case, the OLS estimates are biased In this case, the OLS estimates are biased because:because: Eq. (1) shows that uEq. (1) shows that uitit affects Y affects Y1it 1it while eq. (2) shows while eq. (2) shows
that Ythat Y1it 1it affects Yaffects Y2it2it. As a result, it must be true that u. As a result, it must be true that uitit
is correlated with Yis correlated with Y2it2it in eq. (1). Therefore, the OLS in eq. (1). Therefore, the OLS
estimate of aestimate of a11 would be biased in eq. (1). would be biased in eq. (1).
Eq. (2) shows that vEq. (2) shows that vitit affects Y affects Y2it 2it while eq. (1) shows while eq. (1) shows
that Ythat Y2it 2it affects Yaffects Y1it1it. As a result, it must be true that v. As a result, it must be true that v itit
is correlated with Yis correlated with Y1it1it in eq. (2). Therefore, the OLS in eq. (2). Therefore, the OLS
estimate of bestimate of b11 would be biased in eq. (2). would be biased in eq. (2).
3333
For example, it seems reasonable to argue that housing values depend on rents as well as rents depending on housing values:
rent = a0 + a1 housing_value + a2 pct_population_urban + u
housing_value = b0 + b1 rent + b2 family_income + b3
region2 + b4 region3 + b5 region4 + v Note that for identification, each equation must
contain at least one exogenous variable that is not included in the other equation. These are:
pct_population_urban in the rent model family_income, region2 - region4 in the housing_value
model
3434
We estimate this kind of model using the reg3 command reg3 (depvar1 varlist1) (depvar2 varlist2) use "J:\phd\housing.dta", clear reg3 (rent= housing_value
pct_population_urban) (housing_value = rent family_income region2 region3 region4)
Unfortunately, the Unfortunately, the overidoverid and and endogendog commands are not currently available with commands are not currently available with reg3reg3
3535
5.4 When the endogenous right 5.4 When the endogenous right hand side variable is binaryhand side variable is binary
So far we have been dealing with the case where the So far we have been dealing with the case where the endogenous regressor is continuous.endogenous regressor is continuous.
We may want to estimate a model in which the We may want to estimate a model in which the endogenous regressor is binary.endogenous regressor is binary.
This brings us to a special class of models which are This brings us to a special class of models which are known as “self-selection” or “Heckman” models. known as “self-selection” or “Heckman” models. “Selectivity” = “Endogeneity” where the endogenous “Selectivity” = “Endogeneity” where the endogenous regressor is binaryregressor is binary
The basic idea is similar to the instrumental variable The basic idea is similar to the instrumental variable techniques that we have already discussed.techniques that we have already discussed.
3636
Examples of endogenous binary variables in accounting:Examples of endogenous binary variables in accounting: Companies decide whether to use hedge contracts (Companies decide whether to use hedge contracts (Barton, Barton,
2001; Pincus and Rajgopal, 20022001; Pincus and Rajgopal, 2002).). Companies decide whether to grant stock options (Core and Companies decide whether to grant stock options (Core and
Guay, 1999).Guay, 1999). Companies decide whether to hire Big 5 or non-Big 5 auditors Companies decide whether to hire Big 5 or non-Big 5 auditors
(e.g., Chaney et al., 2004).(e.g., Chaney et al., 2004). Governments decide whether to fully or partially privatize Governments decide whether to fully or partially privatize
(Guedhami and Pittman, 2006).(Guedhami and Pittman, 2006). Companies decide whether to follow international financial Companies decide whether to follow international financial
reporting strategy (Leuz and Verrecchia, 2000).reporting strategy (Leuz and Verrecchia, 2000). Companies decide whether to recognize financial instruments at Companies decide whether to recognize financial instruments at
fair value or disclose (Ahmed et al., 2006). fair value or disclose (Ahmed et al., 2006). Companies decide whether or not to go private (Engel et al., Companies decide whether or not to go private (Engel et al.,
2002).2002).
3737
Selection modelSelection model Concerns about selectivity arise when the RHS Concerns about selectivity arise when the RHS
dummy variable (D) is endogenous:dummy variable (D) is endogenous:
Endogeneity results in bias if E(u | D) Endogeneity results in bias if E(u | D) ≠≠ 0. 0.
If u and v are correlated, then E(u | D) If u and v are correlated, then E(u | D) ≠≠ 0, in which 0, in which case the OLS estimate of the effect of D on Y would case the OLS estimate of the effect of D on Y would be biased.be biased.
Selection modelSelection model
The intuition underlying Heckman is to The intuition underlying Heckman is to estimate and then control for E(u | D). First estimate and then control for E(u | D). First model the choice of D:model the choice of D:
Z is a vector of exogenous variables that Z is a vector of exogenous variables that affect D but have affect D but have no directno direct effect on Y. effect on Y.
3838
3939
Selection modelSelection model
D
Z Y
4040
Selection modelSelection model
Estimate E(u | D) and include it as a Estimate E(u | D) and include it as a control variable on the RHS of the Y control variable on the RHS of the Y model:model: E(u | D) = E(u | D) = IMR where IMR where captures the captures the
correlation between u and v while correlation between u and v while is the is the standard deviation of u and:standard deviation of u and:
4141
Selection modelSelection model The MILLS variable is added as a “control for The MILLS variable is added as a “control for
selectivity” in the Y model:selectivity” in the Y model:
The OLS estimate of the effect of D on Y is now The OLS estimate of the effect of D on Y is now unbiased because E(unbiased because E(εε | D) | D) == 0. 0.
The D and Y models can be estimated in two-The D and Y models can be estimated in two-steps or estimated jointly using maximum steps or estimated jointly using maximum likelihood (ML)likelihood (ML) ML yields separate estimates of ML yields separate estimates of and and .. The two-step yields an estimate of The two-step yields an estimate of .. Under the null of no selectivity bias, Under the null of no selectivity bias, = 0 and = 0 and = 0. = 0.
4242
Class exercise 5bClass exercise 5b
We are going to look at a fictional dataset on 2,000 women.
use "J:\phd\heckman.dta", clear sum age education married children wage
Suppose we believe that older and more highly educated women earn higher wages. Why would it be wrong to estimate the following model?
reg wage age education Estimate a probit model to test whether women
are more likely to be employed if they are married, have children, are older and more highly educated.
4343
5.4 When the endogenous right hand 5.4 When the endogenous right hand side variable is binary (side variable is binary (heckmanheckman))
It is easy to estimate the two-step Heckman It is easy to estimate the two-step Heckman model in STATA: model in STATA: heckman depvar1 [varlist1], select (depvar2 = varlist1),
twostep where depvarwhere depvar11 is the dependent variable in the main is the dependent variable in the main
equation and depvarequation and depvar22 is the dependent variable in the is the dependent variable in the
selection model selection model
Going back to our dataset on female wages:Going back to our dataset on female wages: heckman wage education age, select(emp= married
children education age) twostep
4444
4545
Women’s wages are higher Women’s wages are higher if they are older and more if they are older and more highly educatedhighly educated
The probit model of The probit model of employment is exactly the employment is exactly the same as what we had beforesame as what we had before
Women are more likely to be Women are more likely to be in employment if they are in employment if they are married, have children, are married, have children, are more highly educated or more highly educated or older.older.
The 657 censored observations The 657 censored observations are the women who are not in are the women who are not in employment.employment.
The Wald chi2 tests the overall The Wald chi2 tests the overall significance of the model.significance of the model.
4646
The lamba variable is simply The lamba variable is simply the IMR that was estimated the IMR that was estimated from the emp modelfrom the emp model
The IMR coefficient is 4.00 and The IMR coefficient is 4.00 and statistically significantstatistically significant
there is statistically significant there is statistically significant evidence of a selection effect.evidence of a selection effect.
The IMR coefficient is the product The IMR coefficient is the product of rho and sigma (of rho and sigma () )
Thus, 4.00 = 0.67 * 5.95Thus, 4.00 = 0.67 * 5.95
4747
Class exercise 5cClass exercise 5c Estimate the following audit fee models separately for Estimate the following audit fee models separately for
Big 6 and Non-Big 6 audit clients:Big 6 and Non-Big 6 audit clients: lnaf = alnaf = a00 + a + a11 lnta + u (1) lnta + u (1) lnaf = alnaf = a00 + a + a11 lnsales + u (2) lnsales + u (2) where lnaf = log of audit fees, lnta = log of total assets, lnsales = where lnaf = log of audit fees, lnta = log of total assets, lnsales =
log of saleslog of sales Use the Use the heckmanheckman command to “control” for endogeneity command to “control” for endogeneity
with respect to the company’s selected auditor. Your with respect to the company’s selected auditor. Your auditor choice models are as follows:auditor choice models are as follows:
big6 = bbig6 = b00 + b + b11 lnsales + b lnsales + b22 lnta + v lnta + v nbig6 = cnbig6 = c00 + c + c11 lnsales + c lnsales + c22 lnta + w lnta + w where big6 = 1 (big6 = 0) if the company chooses a Big 6 (Non-where big6 = 1 (big6 = 0) if the company chooses a Big 6 (Non-
Big 6) auditor; and nbig6 = 1 (nbig6 = 0) if the company chooses Big 6) auditor; and nbig6 = 1 (nbig6 = 0) if the company chooses a Non-Big 6 (Big 6) auditor. a Non-Big 6 (Big 6) auditor.
4848
Class exercise 5cClass exercise 5c
What exclusion restrictions are you What exclusion restrictions are you imposing in equations (1) and (2)?imposing in equations (1) and (2)?
Is there statistically significant evidence of Is there statistically significant evidence of selectivity?selectivity?
For the two different specifications of the For the two different specifications of the audit fee model:audit fee model: what are the signs of the MILLS coefficients? what are the signs of the MILLS coefficients? what are the signs of rho?what are the signs of rho?
4949
Treatment effects modelTreatment effects model
In exercise 5c, we estimated the audit fee models In exercise 5c, we estimated the audit fee models separately for the Big 6 and non-Big 6 audit clientsseparately for the Big 6 and non-Big 6 audit clients
To do this, we use the To do this, we use the heckmanheckman command command Suppose that we want to estimate one audit fee model
with Big 6 on the right hand side of the equation (i.e., we assume that the X coefficients have the same slope in the two equations)
5050
Treatment effects modelTreatment effects model We can estimate this model using the We can estimate this model using the treatregtreatreg
commandcommand treatreg lnaf lnta, treat (big6= lnta lnsales) treatreg lnaf lnta, treat (big6= lnta lnsales)
twostep twostep treatreg lnaf lnsales, treat (big6= lnta lnsales) treatreg lnaf lnsales, treat (big6= lnta lnsales)
twosteptwostep If we don’t specify the twostep option we will get If we don’t specify the twostep option we will get
the ML estimates the ML estimates sometimes the ML model will not converge due sometimes the ML model will not converge due
to a nonconcave likelihood functionto a nonconcave likelihood function treatreg lnaf lnta, treat (big6= lnta lnsales) treatreg lnaf lnta, treat (big6= lnta lnsales)
5151
Treatment effects modelTreatment effects model The results for both the treatment effects and The results for both the treatment effects and
Heckman models can be very sensitive to the Heckman models can be very sensitive to the model specification.model specification.
For example, the Big 6 fee premium can easily For example, the Big 6 fee premium can easily flip signs from positive to negative:flip signs from positive to negative: treatreg lnaf lnta, treat (big6= lnta lnsales) twosteptreatreg lnaf lnta, treat (big6= lnta lnsales) twostep treatreg lnaf lnta lnsales, treat (big6= lnta lnsales) treatreg lnaf lnta lnsales, treat (big6= lnta lnsales)
twosteptwostep Note that there are no exclusion restrictions (Z Note that there are no exclusion restrictions (Z
variables) in the second specification since variables) in the second specification since lntalnta and and lnsaleslnsales appear in both the first stage and appear in both the first stage and second stage modelssecond stage models
5252
Exclusion restrictionsExclusion restrictions Francis, Lennox, Francis & Wang (2012) argue that Francis, Lennox, Francis & Wang (2012) argue that
many accounting studies have estimated the Heckman many accounting studies have estimated the Heckman and treatment effects models incorrectlyand treatment effects models incorrectly
It is well recognized (in economics) that exogenous Z It is well recognized (in economics) that exogenous Z variables from the first stage choice model need to be variables from the first stage choice model need to be validly excludedvalidly excluded from the second stage outcome from the second stage outcome regression regression (Little, 1985; Little and Rubin, 1987; Manning (Little, 1985; Little and Rubin, 1987; Manning et al., 1987)et al., 1987)..
Accounting studies have generally failed to: (a) impose Accounting studies have generally failed to: (a) impose exclusion restrictions, or (b) provide compelling grounds exclusion restrictions, or (b) provide compelling grounds for the validity of the exclusion restrictions. for the validity of the exclusion restrictions.
5353
Exclusion restrictionsExclusion restrictions Economists recognize that it is important to Economists recognize that it is important to
justify why the Z’s can be validly excluded from justify why the Z’s can be validly excluded from the Y model. the Y model.
For example, Angrist (1990) examines how For example, Angrist (1990) examines how military service affects the earnings of veteran military service affects the earnings of veteran soldiers after they are discharged from the army.soldiers after they are discharged from the army.
This involves a selection issue because This involves a selection issue because individuals join the military if they have poor individuals join the military if they have poor wage offers in other types of job.wage offers in other types of job.
Angrist (1990) tackles the selectivity issue using Angrist (1990) tackles the selectivity issue using data from the Vietnam era, when military service data from the Vietnam era, when military service was partly determined by a draft lottery.was partly determined by a draft lottery.
5454
Exclusion restrictionsExclusion restrictions
D = military service
Z = Random lottery
Y = civilian earnings
5555
Exclusion restrictionsExclusion restrictions Angrist and Evans (1998) test whether Angrist and Evans (1998) test whether
child bearing reduces female participation child bearing reduces female participation in the labor marketin the labor market Selectivity is an issue because women are Selectivity is an issue because women are
more likely to have children rather than enter more likely to have children rather than enter the labor market if their wage offers would be the labor market if their wage offers would be low (i.e., lower opportunity cost). low (i.e., lower opportunity cost).
Use the gender of the second child as Use the gender of the second child as instrument for the decision to have a third instrument for the decision to have a third child.child.
5656
Angrist and Evans (1998): Exclusion restrictionAngrist and Evans (1998): Exclusion restriction
D = decision to have a third child
Z = Sex composition of first two children
Y = female participation in labor market
5757
Exclusion restrictionsExclusion restrictions
In accounting, many studies fail to justify why Z In accounting, many studies fail to justify why Z has no direct impact on Y.has no direct impact on Y.
Many studies do not report results for the D Many studies do not report results for the D model, so the reader cannot evaluate the power model, so the reader cannot evaluate the power of the Z variables for identifying selectivity.of the Z variables for identifying selectivity.
Some studies estimate models in which there Some studies estimate models in which there are no nominated Z variables.are no nominated Z variables.
5858
Exclusion restrictionsExclusion restrictions When there are no exclusion restrictions, identification of When there are no exclusion restrictions, identification of
the MILLS coefficients relies on the assumed non-the MILLS coefficients relies on the assumed non-linearitylinearity
MILLS will capture any misspecification of the functional MILLS will capture any misspecification of the functional relation between X and Y (e.g., non-linearity) in addition relation between X and Y (e.g., non-linearity) in addition to any selectivity bias. to any selectivity bias.
5959
Exclusion restrictionsExclusion restrictions Little (1985): Relying on nonlinearities to identify Little (1985): Relying on nonlinearities to identify
selectivity bias is “unappealing” because it is very difficult selectivity bias is “unappealing” because it is very difficult to distinguish empirically between selectivity and to distinguish empirically between selectivity and misspecification of the model’s functional form.misspecification of the model’s functional form.
STATA manual: “Theoretically, one does not need such STATA manual: “Theoretically, one does not need such identifying variables, but without them, one is depending identifying variables, but without them, one is depending on functional form to identify the model. It would be on functional form to identify the model. It would be difficult to take such results seriouslydifficult to take such results seriously since the since the functional-form assumptions have no firm basis in functional-form assumptions have no firm basis in theory.”theory.”
A failure to nominate any Z variables can worsen the A failure to nominate any Z variables can worsen the problems of multicollinearity (Manning et al., 1987; problems of multicollinearity (Manning et al., 1987; Puhani, 2000; Leung and Yu, 2000).Puhani, 2000; Leung and Yu, 2000).
6060
Example: Chaney, Jeter and Shivakumar (2004)Example: Chaney, Jeter and Shivakumar (2004)
D = BIG5(company hires a Big 5 or non-Big 5 auditor)
Z = null set Y = Audit fees
6161
Example: Leuz and Verrecchia (2000)Example: Leuz and Verrecchia (2000)
D = IR97(international reporting)
Z = ROA, Capital intensity, UK/US listing.
Y = Cost of capital
6262
6363
Leuz and Verrecchia (2000)Leuz and Verrecchia (2000)
Is it valid to assume that ROA, Capital Is it valid to assume that ROA, Capital intensity, and UK/US listing have no direct intensity, and UK/US listing have no direct effect on the cost of capital?effect on the cost of capital?
Are these Z variables really exogenous?Are these Z variables really exogenous?
6464
6565
Leuz and Verrecchia (2000)Leuz and Verrecchia (2000)
Are the tests for selectivity bias powerful? Are the tests for selectivity bias powerful? Are the results sensitive to functional Are the results sensitive to functional
form? (see the free float variable).form? (see the free float variable). LV do not report results using OLSLV do not report results using OLS LV do not report whether their results are LV do not report whether their results are
sensitive to alternative model sensitive to alternative model specifications.specifications.
6666
Going forwardGoing forward Researchers need to be aware that Heckman and Researchers need to be aware that Heckman and
treatment effects models can provide results that are treatment effects models can provide results that are extremely fragile. Sensitivity primarily affects the RHS extremely fragile. Sensitivity primarily affects the RHS variable that is assumed to be endogenous (D) and the variable that is assumed to be endogenous (D) and the IMRs.IMRs.
Studies need to discuss:Studies need to discuss: why the Z’s are exogenouswhy the Z’s are exogenous why the Z’s have no direct effect on Ywhy the Z’s have no direct effect on Y whether the Z’s are powerful predictors of Dwhether the Z’s are powerful predictors of D
The signs and significance of the IMRs alone do not The signs and significance of the IMRs alone do not provide compelling evidence as to the direction or provide compelling evidence as to the direction or existence of selectivity bias.existence of selectivity bias.
Selection studies should routinely report tests for Selection studies should routinely report tests for multicollinearity problems.multicollinearity problems.
6767
SummarySummary When the endogenous regressor is continuous, When the endogenous regressor is continuous,
you can “control” for endogeneity using the you can “control” for endogeneity using the ivregressivregress or or reg3reg3 commands. commands.
When the endogenous regressor is binary, you When the endogenous regressor is binary, you can “control” for endogeneity using the can “control” for endogeneity using the heckman heckman or or treatregtreatreg commands. commands.
If you want to control for endogeneity, it is vitally If you want to control for endogeneity, it is vitally important that you have a good justification for important that you have a good justification for your chosen exclusion restrictions.your chosen exclusion restrictions.
Choosing arbitrary exclusion restrictions will Choosing arbitrary exclusion restrictions will probably give you garbage results.probably give you garbage results.