metrics 2 notes

27
Metrics 2 Notes Conor Walsh 1 Some Basic Linear Projection Material Consider your sample data Y and the columns of X as points in R n . Call the space spanned by the columns of X as S(X), which has dimension k. By definition, any point in S(X) can be expressed as for β R k . The method of least squares is essentially trying to find the point in the subspace S(X) which is closest to Y . The solution to the problem is ˆ β =(X 0 X) -1 X 0 Y , such that ˆ Y = X ˆ β, and the residuals e = Y - ˆ Y . The residuals are orthogonal to any point in S(X)- which we can see say if we pick X, then e 0 X = Y 0 X - Y 0 X(X 0 X) -1 X 0 X =0 A projection is a mapping that takes any point in E n to a point in the subspace of E n . An orthogonal projection maps that point onto the point in the subspace which is closest to it. Then ˆ Y = X ˆ β = X(X 0 X) -1 X 0 Y = P X Y is the orthogonal projection of Y onto S(X). e = Y - ˆ Y = Y - X(X 0 X) -1 X 0 Y =(I - X(X 0 X) -1 X 0 )Y = M X Y projects Y onto the orthogonal complement of S(X). Clearly Y = M X Y + P X Y Note that P X and M X are idempotent- once a vector is in S(X), further projecting it to the closest point has no effect. Moreover, you can verify that P X M X = 0 - they annihalte one another. You map a point into the space S(X), then the closest point in S(X) that is also in S (X) is 0. 1.1 Frisch-Waugh-Lovell Theorem Consider the linear model Y = + u and partition it into Y = X 1 β 1 + X 2 β 2 + u Suppose we are interested in estimating β 2 . We could get it from OLS, and this would be ˆ β 2 . Or we define the orthogonal projection matrix M 1 = I - X 1 (X 0 1 X 1 ) -1 X 0 1 , which projects any vector in R n onto the orthogonal complement of X 1 . Then we could define Y * = M 1 Y and X * 2 = M 1 X 2 . These are the residuals from running OLS of Y and X 2 on X 1 . Then if we run OLS of Y * on X * 2 , we will get ˜ β 2 . The FWL theorem is that ˆ β 2 = ˜ β 2 . 1

Upload: cwal9355

Post on 16-Jan-2016

232 views

Category:

Documents


0 download

DESCRIPTION

Notes for Econometrics 2 at Yale University

TRANSCRIPT

Page 1: Metrics 2 Notes

Metrics 2 NotesConor Walsh

1 Some Basic Linear Projection Material

Consider your sample data Y and the columns of X as points in Rn. Call the space spanned by thecolumns of X as S(X), which has dimension k. By definition, any point in S(X) can be expressed as Xβfor β ∈ Rk. The method of least squares is essentially trying to find the point in the subspace S(X) whichis closest to Y .

The solution to the problem is β = (X ′X)−1X ′Y , such that Y = Xβ, and the residuals e = Y − Y . Theresiduals are orthogonal to any point in S(X)- which we can see say if we pick X, then

e′X = Y ′X − Y ′X(X ′X)−1X ′X = 0

A projection is a mapping that takes any point in En to a point in the subspace of En. An orthogonalprojection maps that point onto the point in the subspace which is closest to it.

Then

Y = Xβ = X(X ′X)−1X ′Y = PXY

is the orthogonal projection of Y onto S(X).

e = Y − Y = Y −X(X ′X)−1X ′Y = (I −X(X ′X)−1X ′)Y = MXY

projects Y onto the orthogonal complement of S(X).

Clearly

Y = MXY + PXY

Note that PX and MX are idempotent- once a vector is in S(X), further projecting it to the closest pointhas no effect. Moreover, you can verify that PXMX = 0 - they annihalte one another. You map a pointinto the space S(X), then the closest point in S(X) that is also in S⊥(X) is 0.

1.1 Frisch-Waugh-Lovell Theorem

Consider the linear model

Y = Xβ + u

and partition it into

Y = X1β1 +X2β2 + u

Suppose we are interested in estimating β2. We could get it from OLS, and this would be β2. Or wedefine the orthogonal projection matrix M1 = I −X1(X ′1X1)−1X ′1, which projects any vector in Rn ontothe orthogonal complement of X1. Then we could define Y ∗ = M1Y and X∗2 = M1X2. These are theresiduals from running OLS of Y and X2 on X1. Then if we run OLS of Y ∗ on X∗2 , we will get β2. The

FWL theorem is that β2 = β2.

1

Page 2: Metrics 2 Notes

Proof.

β = (X∗′

2 X∗2 )−1X∗

2 Y∗

= (X ′2M1X2)−1X ′2M1Y

Now

Y = PY +MY = X1β1 +X2β2 +MY

The multiply by X ′2M1

X ′2M1Y = X ′2M1X1β1 +X ′2M1X2β2 +X ′2M1MY

First term is 0 by orthogonality, last term is X ′2M1M = X ′2M −X ′2P1M which is zero- since the first termis 0 by projecting X2 onto its orthogonal complement, and the second by the annihilation property of thematrix.

2 Binary Response Models

We study binary response models of the form

P (y = 1|x) = G(x′β) = p(x)

This is sometimes called an “index model”. In most applications, G is a cumulative distribution functionwhose form can sometimes be derived from an underlying economic model. We can derive G more generallyfrom an underling latent variable model

y∗ = x′β + e y = 1[y∗ > 0]

where e is a continuously distribute variable independent of x and symmetric around 0. If G is the cdf ofe, the symmetry implies that 1−G(−z) = G(z). As such,

P (y = 1|x) = P (y∗ > 0|x) = P (e > −x′β|x) = 1−G(−x′β) = G(x′β)

The standard choice for G is N(0, 1) or the CDF of the logisitic distribution. In the second case the CDFis

exp(ui)

1 + exp(ui)

These correspond to Probit and Logit.

If we have a variance greater than 1 for the normal assumption, then we have to divide through by σ toget

P (uiσ<x′iβ

σ|xi) = Φ(x′i

β

σ)

2

Page 3: Metrics 2 Notes

We then have an identification problem- we can only identify the ratio βσ . There are some ways of resolving

this- one is to put a restriction on β a priori.

In the nonlinear model, the partial effect of xik will be

∂G(x′iβ)

∂xik= g(x′iβ)βk

The marginal impact depends on all the independent variables. If you get a partial effect it must have thesame sign as βk, since the density is always positive. If we have a symmetric, single peaked density around0, the strongest partial effect will occur at x′β = 0. One can get the average partial effect by getting theexpectation of φ(x′iβ)βk.

We estimate β by MLE. If we assume the probit form-

P (yi = 1|xi) = Φ(x′iβ) P (yi = 0|xi) = 1− Φ(x′iβ)

The probability of observing yi is

Φ(x′iβ)yi(1− Φ(x′iβ))1−yi

which has the obvious log form. We then maximise the Log likelihood

L(β) =

n∑i=1

[yilogΦ(x′iβ) + (1− yi)log(1− Φ(xiβ))

by choosing β across the k dimension space. If you want to introduce heteroskedasticity,assume

ui|xi ∼ N(0, exp(2x′iγ))

where we use the exponential because it always gives a positive number.

Then

E(yi|xi) = P (yi = 1|xi) = P (xiβ + ui > 0|xi)= P (ui > −x′iβ|xi)

= P (ui

exp(x′iγ)> − xiβ

exp(x′iγ)|xi)

= Φ(x′iβ

exp(x′iγ))

One comment of interest- the linear probability model (least squares) is equivalent to assuming that theerrors are uniformly distributed.

When one does a probit, we should always report the β’s. We can also report a psuedo-R squared givenby 1 − Lur

L0where L0 is the likelihood from just a model with an intercept. Since the log-likelihood in a

binary reponse model is always negative, we have |Lur| ≤ |L0| so this is between 0 and 1 and increasingin the appropriate manner.

3

Page 4: Metrics 2 Notes

2.1 Asymptotic Normality of Binary Response Model

The result we are heading to is that

√n(β − β0)

d→ N(0, V )

where

V = (Eg(x′iβ0)2xix

′i

G(xiβ0)[1−G(x′iβ0)])−1

where g() = G′(). The estimator of the asymptotic variance is

V = (1

n

n∑i=1

g(x′iβ)2xix′i

G(xiβ)[1−G(x′iβ)])−1

We need to show consistency of β first, but we omit this. We need to use the results on consistency ofextremum estimators we learned in Metrics 1 (and will be treated in detail in Metrics 3)

Our criterion function here comes from the log-likelihood.

Qn(β) =1

n

n∑i=1

m(wi, β) =1

n

n∑i=1

[yilogG(x′iβ) + (1− yi)log(1−G(x′iβ))]

To show the convergence in distribution result, we look at Qn(β)−Qn(β0). We expand the first term inthe true neighbourhood of β.

Qn(β)−Qn(β0) ≈ ∇βQn(β0)(β − β0) +1

2∇ββ′(β − β0)′∇ββ′Qn(β0)(β − β0)2

So maximising the LHS by choosing β will also max the RHS, approximately. The FOC for the RHS is

∇βQn(β0) = −∇ββ′Qn(β0)(βn − β0)

We need that

Qn(β)p→ Q(β) = Em(wi, β)

for all β, but this is just an implication of the WLLN.

This implies

∇βQn(β)p→ ∇βEm(wi, β)

∇ββ′Qn(β)p→ ∇ββ′Em(wi, β)

by Slutsky.Now ∇βEm(wi, β0) = 0 in the MLE model (proof below), since by definition we’re at themaximum.

Hence by the CLT

4

Page 5: Metrics 2 Notes

√n∇βQn(β0)

d→ N(0, B)

So we can write our FOC as

√n(β − β0) = [−∇ββ′Qn(β0)]−1∇βQn(β0)

d→ N(0, A−1BA−1)

where the term on the RHS will be invertible will probability 1, because since we have a unique maxmimumto the criterion function Q(β) in the limit, the Hessian must be negative definite.

Now

Q(β) = Em(xi, β) = E[yilogG(x′iβ) + (1− yi)log(1−G(x′iβ))]

We assume we can interchange differentiation and expectation because the function satisfies the conditionsof the dominated convergence theorem.

So

∇βEm(xi, β0) = E∇βm(xi, β0) = E∇β [yilogG(x′iβ0) + (1− yi)log(1−G(x′iβ0))]

= E[g(x′iβ0)xiyiG(x′iβ0)

− g(x′iβ0)xi(1− yi)1−G(x′iβ0)

]

where we have written this as a column vector.

Then we use LIE to get

= E[E[g(x′iβ)xiyiG(x′iβ0)

− g(x′iβ0)xi(1− yi)1−G(x′iβ0)

]|xi]

Now note thatP (y = 1|xi) = E(yi|xi) = G(x′iβ0)

by the Bernoulli RV, so taking through the conditional expectation we

E[g(x′iβ)x′i − g(x′iβ)x′i] = 0

So to get B, just write

∇βQn(β0) =1

n

n∑i=1

[yig(x′iβ0)xiG(x′iβ0)

+(1−yi)−g(x′iβ0)xi1−G(x′iβ0)

] =1

n

n∑i=1

yigxi − yigxiG(x′iβ0)− (1− yi)gxiG(x′iβ0)

(1−G(x′iβ0))G(x′iβ0)

=1

n

n∑i=1

(yi −G(x′iβ0))gxi(1−G(x′iβ0))G(x′iβ0)

B is the variance of the term in the summation, which is just the expectation of itself times the transpose,since the expectation is zero as shown above.

So, skipping a step to get to the LIE version

5

Page 6: Metrics 2 Notes

B = E[E[(yi −G)2g2xix

′i

(1−G(x′iβ))2G(x′iβ0)2|xi]]

If the model is correctly specified, such that E(yi|xi) = G(x′iβ0), then by variance of a Bernoulli E[(yi −G)2|xi] = (1−G)G, so B reduces to

= E[g2xix

′i

(1−G)G]

To get A, we need to do the same thing with ∇ββ′Qn(β0). Trust me, as long as the model is correctlyspecified and E(yi|xi) = G(x′iβ0), we get A = B.

So the asymptotic variance of√n(β − β0) becomes B−1, which we can estimate.

If we want to be robust to misspecification of functional form, we need to estimate the sandwich formA−1BA−1.

2.2 Endogeneity in Binary Response Models

The typical source of endogeneity is omitted variable bias

y∗i = xiβ + ui + ci

Assume ci is independent of xi and ui first, and c ∼ N(0, σ2c ). Write ci + ui = vi. Then vi ∼ N(0, 1 + σ2)

by the sum of two normals. The variance is not one any more, so we have a problem. Then we have

E(yi|xi) = P (y∗i > 0|xi) = P (vi > −x′iβ|xi)

= P (vi√

1 + σ2<

x′iβ√1 + σ2

|xi)

= Φ(x′iβ√1 + σ2

)

As above, if we try to estimate this by Maximum Likelihood we can only identify β√1+σ2

, since we have no

way of estimating σ since ci is unobservable. So βp→ β√

1+σ2. This is called attenuation bias, such that the

estimate is always smaller than the true value in probability. Compare this with the linear model of OLS,where we still have a consistent estimator in the case of an unobservable variable that is uncorrelated withthe observables.

We now study approaches to endogeneity. The first is the control function approach.

Suppose we have endogeneity such that y∗ = x′iβ + ui, cov(xi, ui) 6= 0

y∗i = x′iγ + ui = z′

1iβ + αy2i + ui

where here the endogenous variables is y2. Thus the IVS are

zi = (z1i, z2i)′

We assume ui ∼ N(0, 1).

6

Page 7: Metrics 2 Notes

The idea of the control function is to introduce a new term vi such that ui and vi are correlated, and theonly bit left is ei, which is uncorrelated with y2i. We get vi from the linear projection

y2i = z′iδ + vi

We project ui on vi to get

ui = θvi + e

where θ = σuv

σ2v

. Then by linear projection E(viei) = 0.

We assume the conditional distribution is

(uv

)|z ∼ N(0,Σ)

where

Σ =

[1 σuvσuv σ2

v

]So returning to the original equation we have

yi = z′1β + αy2i + θvi + ei

The distribution of ei satisfies D(ei|z1i , y

2i , vi) = D(ei|zi, vi) since y2

i is a linear combination of zi and vi.Then

D(ei|zi, vi) = D(ei)

since vi is independent of ei by the linear projection. zi is independent of vi and ui by assumption.ei = ui − θvi, so we can get the variance from this, and since both ui and vi are normal we can get thedistribution as the sum of normals.

Define a correlation between u and v for convenience

ρ =σuvσv

Then the distribution is

D(ei) = N(0, 1− ρ2)

Now

P (y = 1|z1, y2v) = P (e > −(z′1β + αy2 + θv)|z1, y2, v)

= P (e√

1− ρ2<

(z′1β + αy2 + θv)√1− ρ2

)

We also need to estimate ρ. We do this by

7

Page 8: Metrics 2 Notes

θ =σuvσ2v

σv

So since we can estimate θ and σv, we can estimate ρ.

IV Probit is the other approach.

The key in the IV Probit is to obtain the joint density of f(y1, y2|z), since then we can get a likelihoodfunction. By the properties of conditional densities, we know

f(y1, y2|z) = f(y1|y2, z) · f(y2|x)

Now

f(y2|z) = f(z′δ + vi|z) ∼ N(z′iδ, σ2v)

= φ(y2 − z′iδσv

)/σv

Now our density of the binary random variable really comes from P (y1 = 1|y2, z). First, we need a result.

1. Suppose

[uivi

]∼ N(0,

[1 σuvσuv σ2

v

])

Then

ui|vi ∼ N(σuvv

σ2v

, 1− σ2uv

σ2v

)

where we can denote the variance as 1− ρ2.

Then

P (y1 = 1|y2, z)

= P (z′

1iβ + αy2i + ui > 0|y2, z)

= P (z′

1iβ + αy2i + ui > 0|vi, z)

= P (ui > −(z′

1iβ + αy2i))

= P (ui − σuvv

σ2v√

1− ρ2>−(z

1iβ + αy2i + σuvvσ2v

)√1− ρ2

)

= Φ((z′

1iβ + αy2i + σuvvσ2v

)√1− ρ2

)

= Φ((z′

1iβ + αy2i + ρσ−1v (y2 − z′δ)√

1− ρ2)

= Φ(wi)

8

Page 9: Metrics 2 Notes

Then our likelihood function becomes

Li = Φ(wi)y1 [1− Φ(wi)]

1−y1φ(y2 − z′iδσv

)/σv

and we can estimate this by MLE. We need estimates of β, α, ρ, σv, δ

2.3 Why TSLS doesn’t work here

Why not do 2SLS? This would be a linear projection of

y2 = z′δ

And then a probit of y on z1 and y2.This doesn’t work for the following reqason

y = I(z′

1β + αy2 + u > 0)

= I(z′

1β + α(z′δ + v) + u > 0)

I(z′

1β + α(z′δ) + w > 0)

And variance of w is

V ar(w) = V ar(αu+ v)

= α2σ2 + 1 + 2ασuv

So if we did this form of Probit we would estimate

β

σw

α

σw

The problem comes from trying to estimate σuv, since u is unobserved.

2.4 Binary Regressors

What if y2 is in fact Binary?

We have to assume now that

(uv

)∼ N(0,

[1 σuvσuv 1

])

We can build a Probit for it the likelihood for observing observation i is now

f(y, y2|z)

where the are 4 possibilities

9

Page 10: Metrics 2 Notes

p00 = P (y = 0, y2 = 0|z)p01 = P (y = 0, y2 = 1|z)p10 = P (y = 1, y2 = 0|z)p11 = P (y = 1, y2 = 1|z)

So the likelihood of observing this variable is

p(1−y)(1−y2)00 p

y2(1−y)01 p

y(1−y2)10 pyy211

where the i should be subscripted. Now

p00 = P (u < −z′1β − αy2, v < −z′δ|z)p00 = P (u < −z′1β, v < −z′δ|z)

= Φ2(−z′1β,−z′δ)

where φ2 is the CDF of a bivariate standard normal with correlation ρ = σuv

1 = σuv

p01 = P (u < −z′1β − αy2, v > −z′δ|z)p01 = P (u < −z′1β − α, v > −z′δ|z)

= P (u < −z′1β − α|z)− P (u < −z′1β − α, v < −z′δ|z)= Φ(−(z′1β + α))− Φ2(−(z′1β + α),−z′δ, ρ)

We can also get

p10 = Φ(−z′δ)− p00

p11 = 1− p00 − p10 − p10

And then run MLE to get estimates of β, α and ρ and δ.

2.5 Binary Response Quantile Regression

y∗ = x′β + u

where Qu(τ) = 0, τ ∈ (0, 1).

y = I(y∗ > 0)

y is monotone increasing. By equivariance of quantile transformation: g(y∗) = I(y∗ > 0) is increasing iny∗.

Qy∗(τ) = x′β

Qg(Y ∗)(τ) = g(x′β)

10

Page 11: Metrics 2 Notes

soQy(τ) = I(x′β > 0)

To estimate this we minimise

min

N∑i=1

Cτ (yi − I(x′iβ > 0))

and Manski’s maximum score estimator is

βτ = argmin

N∑i=1

Cτ (yi − I(x′iβ > 0))

When τ = 0.5, this is the absolute value function. This estimates the conditional mean parameter withoutassumptions of normality and independence. For some reason it only works up to scale, so we normaliseβp = 1.

One assumption we need is that at least one regressors with non-zero ceofficient is continuous.

The true parameter in the quantile is not defined- check wooldridge

2.6 Binary Response with Panel Data

We now have yit and xit and yit = I(x′itβ + uit > 0).

1.Strict exogeneity here is on the distribution

f(yit|xi, ci) = f(yit|xit, ci)

where now the xi is a column vector across time but ci is fixed across time. So the distribution onlydepends on t.

2.Conditional independence here is that

f(yi1, ...yiT |xi, ci) =T∏t=1

f(yit|x,ci)

So using these two we get that

f(yi1, ...yiT |xi, ci) =

T∏t=1

f(yit|xi, ci) =

T∏t=1

f(yit|xit, ci)

=

T∏t=1

Φ(x′iβ + ci)yi(1− Φ(x′iβ + ci))

1−yi

usingp(yit = 1) = Φ(x

itβ + c)

But this depends on ci which is unobserved. Now

f(yi, ..., yiT |xi) =

ˆf(yi, ..., yiT , ci|xi)dc =

ˆf(yi, ..., yiT , |xi, ci)f(ci|xi)dc

11

Page 12: Metrics 2 Notes

The RE assumption is that

ci|xi ∼ N(0, σ2)

A correlated RE assumption is that

ci|xi ∼ N(x′iγ, σ2c )

where

xi =1

T

T∑t=1

xit

This weaker and more general.

So we can integrate out with the normal pdf. So the MLEs are

(β, σ2c ) = argmax

n∏i=1

f(yi|xi)

Simulated maximum likelihood approximates the integration by a random draw from the mooted distri-bution of c.

In nonlinear models, Random effects assumptions that we specify the distribution of ci. Fixed effectsmeans we do not have to be specific, and thus is weaker. Asymptotic distribution comes from the generaltheory of the MLE, i.e. taking derivatives and doing to the convergence.

In fixed effects, we do the fixed effects logit. This is computationally equivalent to doing a logit of yion xi2 − x1i.

The Panel logit model is

P (yit = 1|xit, ci) = G(x′

itβ + ci)

where Gi is the logit function

G(x) =ex

1 + ex

Define a variable

Ni =

T∑t=1

yit

We want f(yi|xi, ci, Ni) (note the vector) not to depend on ci. Note that the density for Ni = 0 andNi = 2 is not informative, as

P (yi1 = 1|xi, ci, Ni = 0) = P (yi2 = 1|xi, ci, Ni = 0) = 0

P (yi1 = 1|xi, ci, Ni = 2) = P (yi2 = 1|xi, ci, Ni = 2) = 2

So we select out the observations for which Ni = 1

12

Page 13: Metrics 2 Notes

Now the idea is to get the joint density for these observations, i.e.

f(yi1, y12|xi, ci, Ni = 1)

Now

P (yi1 = 1, yi2 = 0|xi, ci, Ni = 1) =P (yi1 = 1, yi2 = 0, Ni = 1|xi, ci)

P (N1 = 1|xi, c)

by the definition of conditional independence.

=P (yi1 = 1, Ni = 1|xi, ci)

P (N1 = 1|xi, c)

The top term is just

P (yi1 = 1, Ni = 1|xi, ci) = P (yi1 = 1, yi2 = 0|xi, ci)= P (yi1 = 1|xi, ci)P (yi2 = 0|xi, ci)

by conditional independence of the yit’s.

= G(x′i1β + c)[1−G(x′i2β)]

=exp(x′i1β + c)

1 + exp(x′i1β + c)[1− exp(x′i2β + c)

1 + exp(x′i2β + c)]

=exp(x′i1β + c)

(1 + exp(x′i1β + c))(1 + exp(x′i2β + c))

Eventually the ci will cancel.

The denominator is the sum of two probabilities and can be written

= (1−G(x′

i1β + ci))G(x′

i2β + ci) +G(x′

i1β + ci)(1−G(x′

i2β + ci))

worked out but becomes

ex12β+ci + ex11β+ci

(1 + ex12β+ci)(1 + ex1β+ci)

so the whole expression becomes

exi1β

exi1β + exi2β=

e(xi1−xi2)β

1 + e(xi1−xi2)β

Lo and behold, the ci has cancelled.

On the other hand, P (yi2 = 1|xi, ci, Ni = 1) = 1− P (yi1 = 1|xi, ci, Ni = 1)

=1

1 + e(xi1−xi2)β

So our joint density is

13

Page 14: Metrics 2 Notes

f(yi1, y12|xi, ci, Ni = 1)

= P (yi1 = 1|xi, ci, Ni = 1)yi1P (yi2 = 1|xi, ci, Ni = 1)yi2

From here we can get the log likelihood function.

3 Multiple Value Responses

We discuss discrete value models, both

1. Ordered Responses

2. Unordered Responses

In an ordered response model

y =

0

1...

J

Then we define cutoff values of y∗ corresponding to y, α1, ..., αJ . Then if y∗ < α1, y = 0. For a generaly = j. Then αj < y∗ < αj+1.

Then we assume

y∗ = x′β + u

where u|x ∼ N(0, 1) and the x doesn’t have any intercept.

Parameters to be estimatied are α1, ..., αJ , β which is J+k parameters. We can also do Probit. Choice prob-abilities are

P (y = 0|x) = P (u < α1 − x′β|x)

= Φ(α1 − x′β)

Now

p1(x) = P (y = 1|x) = P (α1 < y∗ < α2)

= P (α1 − x′β < u < α2 − x′β)

= Φ(α2 − x′β)− Φ(α1 − x′β)

and of course this holds for pj(x). Lastly

PJ(x) = 1− Φ(αj − x′β)

When J = 1

14

Page 15: Metrics 2 Notes

p1(x) = Φ(x′β − α1)

and α1plays the role of the intercept. Once we do this we get a likelihood function, and we can use MLEto estimate αand β.

Predicted choice for individual i will be the j which maximises the choice probability P (y = j|x) (noteyou can fit all the probabilities once you have your estimated coefficients).

3.1 Endogeneity

y∗ = βy2 + u

y1 =

0

1...

J

Where u and y2are correlated. Assume we have an instrument z and a reduced form

y2 = z′γ + v

(uv

)∼ N(0,

[1 σuvσuv σ2

v

])

independent of z. The control function does a linear projection

u = θv + e

θ = θuv

θ2v

So

y∗1 = βy2 + u

= βy2 + θv + e

where e|z ∼ N(0, 1− ρ2)

We do an ordered probit of y1 on y2 and v. This would consistently estimate β√1−ρ2

. That as we estimate

θ = θ√1−ρ2

and

θ =σuvσ2v

σv

since we can estimate σv we can solve for θ and ρ.

When y2 is binary, we use the following reduced form

y2 = I(z′γ + v > 0)

15

Page 16: Metrics 2 Notes

(uv

)|z ∼ N(0,

[1 ρρ 1

])

Now we calculate

f(y1, y2|z)

where y2 is binary and y1 can take multiple values.

Now we have

P (y1 = j, y2 = 0|z)P (y1 = j, y2 = 1|z)

for all the different j, and from that we can get the maximum likelihood estimator. For example,

P (y1 = 1, y2 = 0|z)= P (α1 < βy2 + u < α2, zγ + v < 0|x)

= P (α1 < u < α2, zγ + v < 0|x)

= P (u < α2, v < −zγ|x)− P (u < α1, v < −zγ|x)

and since u and v are jointly bivariate standard normal this is

= Φ2(α2,−z′γ, ρ)− Φ2(α2,−z′γ, ρ)

4 Multinomial Logit

1. The Multinomial Logit is used to to model variables like occupational choice, transportation modeetc, which come in discrete categories, but where there is no obvious ordering to an ordering of thecategories

2. We model the response probabilities as follows

pj(x) = P (y = j|x) =exp(x

′βj)

1 +∑Jh=1 exp(x

′βh)j = 1, ..., J

Note the different indices on the β’s- there is a different vector for each category.

3. We need the probabilities to sum to 1, so we set the 0th category probability to

p0(x) = P (y = 0|x) =1

1 +∑Jh=1 exp(x

′βh)

4. A simpler interpretation of these then comes from using the relative probabilities

pj(x)

p0(x)= exp(x

′βj)

16

Page 17: Metrics 2 Notes

5. So βj can be given a simple interpretation in terms of the relative partial effect versus the baseline-

d

dxk[pj(x)

p0(x)] = βjkexp(x

′βj)

The raw partial effects are complicated and ugly.

6. We can use this to write the log-likelihood function as

L =

J∑h=0

1[yi = h]log(ph(x))

5 Tobit

Note a nice fact. If z is N(0,1), then E(z|z > c) = φ(c)1−Φ(c) . To see this, note the conditional density for

z > c is φ(z)1−Φ(c) , and then integrate by parts

ˆ ∞c

zφ(z)

1− Φ(c)dz

5.1 Type I Tobit

1. The assumptions are:

(a) The observable variable is y = max(0,x′β + u)

(b) u|x ∼ N(0, σ2)

2. Since u has unbounded support and is independent of x, P(y = 0) > 0. We often write the modelusing a latent variable y∗:

y∗ = x′β + u, u|x ∼ N(0, σ2), y = max(0, y∗)

3. Suppose E(u|x) = 0. Since g(z) = max(0, z) is convex, from Jensen’s inequality

E(y|x) ≥ max(0,E(x′β + u)|x) = max(0,x′β)

So we have a lower bound for E(y|x).

4. When u is independent of x and has normal distribution, we can find an explicit expression forE(y|x). First note that

E(y|x) = E(y|x, y > 0)P(y > 0|x)

(a) Since u|x ∼ N(0, σ2),

P(y > 0|x) = P

(u

σ> −x′β

σ

)= Φ

(x′β

σ

)

(b) For the first term, we use the following lemma: if z ∼ N(0, 1), then

E(z|z > c) =φ(c)

1− Φ(c)

17

Page 18: Metrics 2 Notes

Since u|x ∼ N(0, σ2), u/σ|x ∼ N(0, 1). Hence

E(u|u > c) = σE(uσ|uσ>c

σ

)= σ

φ(c/σ)

1− Φ(c/σ)

(c) So (using symmetry in the last step)

E(y|x, y > 0) = x′β + σφ(−x′β/σ)

1− Φ(−x′β/σ)= x′β + σ

φ(x′β/σ)

Φ(x′β/σ)

which gives the desired expression, which is nonlinear:

E(y|x) = x′βΦ(x′β/σ) + σφ(x′β/σ)

5. Taking derivatives of E(y|x, y > 0) and E(y|x) wrt xj shows that average partial effects do notdepend solely on βj , but are instead scaled by a factor less than 1. By chain rule,

∂E(y|x)∂xj

=∂P(y > 0|x)

∂xj· E(y|x, y > 0) + P(y > 0|x) · ∂E(y|x, y > 0)

∂xj

= Φ

(x′β

σ

)βj

i.e. the partial effect of xj is βj times the probability that y > 0 given x by a remarkable simplifica-tion.

6. Implies that full sample OLS of y on x is not consistent, because the correct functional form isnonlinear in x, which is what OLS imposes.

7. Subsampe OLS (keep only observations above 0) is not consistent either. Write the artificial regres-sion

y1(y > 0) = x′β1(y > 0) + σλ

(x′β

σ

)1(y > 0) + e

If E(e|x, y > 0) = 0 then

E

[e

∣∣∣∣x1(y > 0), λ

(x′β

σ

)1(y > 0)

]= 0

as well since (from 3.c)

E(e|x, y > 0) = E(y|x)− x′β − σλ(x′β

σ

)= 0

Again, this is nonlinear in x and so OLS will not be consistent.

8. So instead of OLS, we estimate the Tobit model using MLE:

f(yi|xi) = P(yi = 0|x)1(yi=0)f(yi|xi, yi > 0)1(yi>0)

Since P(y > y|x) = Φ(y−x′βσ

), we have f(yi|xi, yi > 0) = φ

(yi−x′β

σ

)/σ. Hence

f(yi|xi) = Φ

(x′β

σ

)1(yi=0) [φ

(x′β

σ

)/σ

]1(yi>0)

9. For Tobit, We need the distribution to satisfy

u|x ∼ N(0, σ2)

10. We can instead assume Med(u|x) = 0 which allows correlation between u and x. This is a weakerassumption. Then

18

Page 19: Metrics 2 Notes

Med(y|x) = max{0,Med(y∗|x)}= max{0, x′β}

Note that y = max(0, y∗) is an increasing function of y∗, and the median function passes through increasingfunctions.

Then β can be estimated by

β = argmin

n∑i=1

|yi −max{0, x′β}|

and this will be√n consistent. This is called Pouell’s Censored LAD estimator.

Alternatively, if we have a normal distribution on the errors we get

E(y|x) = Φ(x′β/σ)x′β + σy(x′β/σ)

which we can estimate through MLE.

5.2 Endogeneity

y∗i = x′β + u

yi = max(0, y∗i )

where x and u are correlated.We have an instrument z which is uncorrelated with u. Reduced form gives

x = z′γ + v

and

(uv

)|z ∼ N(0,

[σ2u σuv

σuv σ2v

])

For control function approach we have

u = θv + e

and

θ =σuvσ2v

and

y∗ = xβ + θv + e

and

e|x, θ ∼ N(0, σ2e)

19

Page 20: Metrics 2 Notes

Now x and e are independent. Now we just do the above Tobit

Noteσ2e = σ2

u − θ2σ2v

Tobit will give us consistent estimates of β, θ, σ2e .

Censored regression has

y∗ = x′β + u

y = min(c, y∗)

So we need

E(u|u < c)

We can do pretty much the same thing.

Sample Selection

6 Sample Selection

6.1 Truncated / Censored Regressions

1. Let

yi =

{x′iβ + ui if Si = 1

not observed if Si = 0

where ui|xi ∼ N(0, σ2).

2. Missing at random assumption: E(ui|xi, Si) = 0

3. OLS of yiSi on xiSi gives a consistent estimate of β under missing at random. For example, ifSi = 1{a < xi < b} then Si depends only on xi (i.e. we are still conditioning only on xi) and soOLS will be consistent:

4. Regressing yisi on xisi gives the OLS estimator

β = (∑

xix′

isi)−1(∑

xiyisi)

= β +

(1

n

∑xix′isi

)−1(1

n

∑xiuisi

)which converges to β since under the missing at random assumption E(xiui|xi) = 0

5. Problem arises when Si depends on yi, because yi is correlated with ui. For example a censored /truncated regression is when

si = {a < yi < b}

then we don’t have MAR because

E(u|x, s) = (u|x, {a < x′β + u < b}) 6= 0

20

Page 21: Metrics 2 Notes

6. We can still do MLE with a censored regression:

f(yi|xi, si = 1) =f(yi, si = 1|xi)

P(si = 1|xi)

The denominator is

P(si = 1|xi) = P(a < yi < b|xi)

= P

(a− x′iβ

σ<uiσ<b− x′iβσ

∣∣∣∣xi)= Φ

(b− xiβσ

)− Φ

(a− xiβ

σ

)for the numerator, note that

P (yi < y, si = 1|x) = P (a < yi < y|x)

= P

(a− x′iβ

σ<u

σ<y − x′iβ

σ

)= Φ

(y − x′iβ

σ

)− Φ

(a− x′iβ

σ

)Taking the derivative wrt y gives the density:

1

σφ

(y − x′iβ

σ

)and so

f(yi|xi, si = 1) =φ(y−x′iβσ

)σ[Φ(b−xiβσ

)− Φ

(a−xiβσ

)]7. Note that as a→ −∞ and b→∞ the likelihood approaches the uncensored probit estimator:

f(yi|xi) = φ

(y − x′iβ

σ

)1

σ

8. Truncated Regression

E(yi|xi, si = 1) = x′iβ + E(ui|xi, a− xiβ ≤ ui < b− xiβ)

9. Now

E(u|x) = 0 = E(u|x, u < a)Pr(u < a)+E(ui|xi, a−xiβ ≤ ui < b−xiβ)Pr(a−xiβ ≤ ui < b−xiβ)+E(u|x, u > b)Pr(u > b)

=−φ(a)

Φ(a)Φ(a) + E(ui|xi, a−xiβ ≤ ui < b−xiβ)[Φ(b−xβ)−Φ(a−x′β)] +

φ(b)

(1− Φ(b))(1−Φ(b)) =⇒

E(ui|xi, a− xiβ ≤ ui < b− xiβ) =φ(a)− φ(b)

Φ(b− x′β)− Φ(a− x′β)

6.2 Another selection indicator

1. Now let yi = x′iβ + ui and the selection indicator be

si = 1{z′iγ + vi > 0}

where xi is a subset of zi and we assume that(uivi

)∣∣∣∣zi ∼ N (0,

[σ2u σuv

σuv 1

])

21

Page 22: Metrics 2 Notes

2. We can write u as a linear projection of v:

u = θv + e

where θ = σuv since σ2v = 1. Then (note that the expected value of e conditional on z is 0)

E(yi|xi, si = 1) = x′iβ + E(ui|xi, vi > −z′iγ)

= x′iβ + E(θvi + ei|zi, vi > −z′iγ)

= x′iβ + θE(vi|vi > −z′iγ)

= x′iβ + θλ(z′iγ)

3. Using the result from (2) we can write the artificial regression

yi1{si = 1} = x′1(si = 1)︸ ︷︷ ︸x1i

β + θ λ(z′iγ)1(si = 1)︸ ︷︷ ︸x2i

+wi

where E(wi|x1i, x2i) = 0 because

E(wi|x1i, x2i) = E[E(wi|zi,1{si = 1})|x1i, x2i]

= E[0|x1i, x2i] = 0

by the LIE

4. Heckman’s two-stage correction:

(a) Run a probit of si on zi to get γ

(b) Run yi1{si = 1} on xi1{si = 1} and λ(z′iγ)1{si = 1}

6.3 Type II Tobit

1. The sample selection criterion is:si = 1{y2i > 0}

where y2 is some other variable that depends on x.

2. Start by taking a linear projection of y2 on xi:

y2 = x′δ + v

so that we can rewrite the sample selection function as

si = 1{x′iδ + vi > 0}

3. Then the truncated regression (using the Heckman correction) is

yisi = β′ xisi︸︷︷︸z1i

+θ λ(x′iδ)si︸ ︷︷ ︸z2i

+wi

whereE(wi|z1i, z2i) = 0

and we assume (uivi

)∣∣∣∣xi

∼ N(

0,

[σ2u σuv

σuv 1

])4. We can also estimate the parameters using MLE.

22

Page 23: Metrics 2 Notes

5. Partial MLE:

(a) The likelihood we are looking for is

f(yi, si|xi) = P(si = 1|y, x)f(yi|xi)

Take a linear projection of vi on ui: vi = θui + ei where θ = σuv

σ2u

and ei⊥ui. Then

P(si = 1|y, x) = P(si = 1|ui, xi)= P(x′iδ + vi > 0|ui, xi)= P(x′iδ + θui + ei|ui, xi)

= P

(e

σe>−(x′iδ + θui)

σe

∣∣∣∣ui, xi)= Φ

(x′iδ + θ(yi − x′iβ)

σe

)(b) This allows us to estimate β, δ, σ2

u, and σuv using MLE

6. The average partial effect APE is

E

(∂E(yi|xi)∂xi

)= E(f(xi))

6.4 Sample Selection Summary

1. Under the MAR assumption that E(ui|xi, si) = 0, we can simply do OLS.

2. When si = {a < yi < b}, the conditional mean of ui is no longer 0 (MAR doesn’t hold) and so OLSis not consistent. In this case, we have to use the truncated regression:

yisi = sixiβ + σλ(x′β/σ) + wi

where λ(c) = φ(c)Φ(c) and E(wi|xi) = 0

3. When si = I{y2i > 0}, we have a Type II Tobit. Can either use the Heckman correction or take alinear projection of y2 on x, rewrite the sample selection criterion based on x, and then run MLE.

7 Duration Models

1. We model the duration, or suvival time of an individual, conditional on a set of covariaties. If wehave the CDF of duration T

F (t) = P (T ≤ t) t ≥ 0

with density f(t),the survivor function is defined as

S(t) = 1− F (t) = P (T > t)

2. We are interested in the object P (t ≤ T < t + ε|T ≥ t) or the probability of failing in an intervalgiven you have survived up until t. The hazard rate is defined as

h(t) = limε→0

P(t < T < t+ ε|T ≥ t)ε

= limε→0

ε−1P(t < T < t+ ε)

P(T ≥ t)

=f(t)

S(t)

23

Page 24: Metrics 2 Notes

It is the “instantaneous probability” of leaving at time t.

3. Of course, the hazard function is the derivative of log(S(t)), so integrating and rearanging

−h(t) =d

dtlog(S(t))

gives

F (t) = 1− exp[−ˆ t

0

h(s)ds]

4. So we can write any probabilities in terms of the hazard function. For example, for a1 < a2

P (T ≥ a2|T ≥ a1) =1− F (a2)

1− F (a1)= exp[−

ˆ a2

a1

h(s)ds]

5. And we can also derive

P (a1 ≤ T < a2|T ≥ a1) =F (a2)− F (a1)

1− F (a1)

=exp[−

´ a10h(s)ds]− exp[−

´ a20h(s)ds]

exp[−´ a1

0h(s)ds]

= 1− exp[−ˆ a2

a1

h(s)ds]

which is useful for constructing log-likelihood functions

6. If h(t) is flat we say there is “duration independence.” An example is the exponential distribution:

f(t) =

{λ exp(−λt) t > 0

0 t ≤ 0

Then F (t) = 1− exp(−λt) and the hazard rate is

h(t) =f(t)

1− F (t)=λexp(−λt)exp(−λt)

= λ

7. Positive duration dependence, when the hazard rate is increasing, implies that the probability ofexiting the initial state increasing the longer one is the state

8. If T has a Weibull distribution, it has CDF and density

F (t) = 1− exp(−γtα)

f(t) = αγta−1exp(−γtα)

So the hazard rate is

h(t) =αγta−1exp(−γtα)

exp(−γtα)= αγta−1

9. The Weibull reduces to the exponential when α = 1. When α > 1, positive duration dependence,when α < 1, negative duration dependence.

24

Page 25: Metrics 2 Notes

10. The log-logistic has CDF and density

F (t) = 1− 1

1 + γtα

f(t) =αγtα−1

(1 + γtα)2

so that the hazard rate is

h(t) =αγtα−1

(1 + γtα)2/

1

1 + γtα

=αγtα−1

1 + γtα

7.1 Duration Models with Covariates

1. Two types of models: proportional duration and accelerated failure time (AFT)

2. Proportional duration model:Ti = m(xi, β)ui

where ui is an error term which is standardized to Eui = 1. Typically we use m(xi, β) = exp(x′iβ).

3. Then

F (t) = P(T ≤ t)= P(m(x, β)u ≤ t)

= P

(u ≤ t

m(x, β)

)= F0

(t

m(x, β)

)and

f(t) = f0

(t

m(x, β)

)1

m(x, β)

which allows us to construct the log-likelihood function. By definition,

h(t) =f(t)

1− F (t)=

1

m(x, β)h0

(t

m(x, β)

)

7.2 Proportional Hazard (PH)

1. For PH we model the hazard rate directly:

h(t) = m(x, β) · h0(t)

where h0(t) is baseline hazard.

F (t) = 1− e−´ t0h(v)dv

from above, which we can write as

F (t) = 1− e−´ t0m(x,β)h0(v)dv

2. So the density is

f(t) = m(x, β)h0(t)e−´ t0m(x,β)h0(v)dv

from which we can get the log-likelihood and estimate β.

25

Page 26: Metrics 2 Notes

7.3 Censored Data

1. When we are doing duration modeling, it is likely the data is right censored, since we stop trackingthe subjects at some stage. Let there be observations i = 1, ..., n. Starting time for each observationis

ai ∈ [0, bi]

where bi is the time you stop tracking subjects. Then we observe

Ti = min(T ∗i , ci)

where T ∗i is the true duration and ci = bi − ai. Note that ci could be different for each observationif starting times are different but tracking ends for everyone at the same time.

2. Let T ∗i ∼ f∗(t|x). Let di be a censored indicator (0 if censored, 1 if uncensored). Then the conditionallikelihood is

f(Ti|xi) = (f∗(ti|xi))di · (1− F ∗(ci))1−di

where the second parenthesis term is simply P(t∗i > ci)).

3. Once we have the estimate for β, we can estimate the hazard function. For example, for a proportionalhazard model with a Weibull distribution, we have

h(t) = exp(x′

iβ)αtα−1

4. We can probably estimate the indexing parameters of the CDF, α through MLE as well.

7.4 Kaplan - Meier Estimator of S

1. This is a nonparametric estimator of the survival function (without covariates) which allows censor-ing. When there is no censoring,

S(t) = 1− F (t)

where F (t) is the empirical CDF: 1n

∑ni=1(Ti ≤ t)

2. Let the times the event happens (i.e. someone finds a job) be indexed by

t1 < t2 < ... < tK

3. Let nk be the number of observations “at risk” right before tk. That is, the number of people towhom the event has not yet occured right before tk. So for example n1 = n.

4. Let nk − nk+1 equal dk (# of obs that happen at tk) plus # of obs censored at tk

5. Then

S(t) =

{1 if t < t1∏{k:tk≤t}

nk−dknk

otherwise

6. Suppose there were no censoring. Then S(t) is simply

n1 − d1

n1· n2 − d2

n2· n3 − d3

n3. . .

which simplifies ton3 − d3

n1

26

Page 27: Metrics 2 Notes

7. If there is censoring (for example) at t1 then we have

n1 − d1

n1· n2 − d2

n2· n3 − d3

n3. . .

where n1 is the censored number of observations at risk prior to t1

8. Consider the case where we track 10 people for 10 weeks and there is censoring in the second week.The number of people who find a job each week (or something) is 1,1,2,2,3,10,10,10.

S(1) = P(T > 1) =8

10

S(2) = P(T > 2) =8

10· 6

10

S(3) = P(T > 3) =8

10· 6

8· 3

4

S(4) = P(T > 4) =8

10· 6

8· 3

4· 3

3

8 Appendix

8.1 Wald Statistic

Suppose we have an estimator that satisfies

√n(θn − θ)

d→ N(0, V (θ))

for some covariance matrix V (θ), under the null-hypothesis. Suppose we want to test the restriction that

H0 : h(θ) = 0 H1 : h(θ) 6= 0

where h(.) is a function that take values in Rr and is continuously differentiable in the neighbourhood ofthe true value θ, with r ≤ k.

If we let H(.) = δh(θ)δθ ∈ Rr×k. By the delta method,

√n(h(θn)− h(θ)) =

√n(h(θn))

d→ N(0, H(θ)V (θ)H(θ)′) = Z

By Slutsky’s theorem, H(θn)p→ H(θ). Suppose we have an estimator that satisfies V (θn)

p→ V (θ).

Then, again by Slutsky H(θn)V (θn)H(θn)′p→ H(θ)V (θ)H(θ)′.

We assume that V (θ) is nonsingular and that H(θ) has full row rank, so that H(θ)V (θ)H(θ)′ is invertible.Then by the Continuous Mapping Theorem

Wn =√n(h(θn))′(H(θn)V (θn)H(θn)′)−1

√n(h(θn))

d→ Z ′(H(θ)V (θ)H(θ)′)−1Z

Note that since H(θ)V (θ)H(θ)′ is symmetric by properties of the transpose and by positive definitenessof V (θ), it can be diagonalised, so in the usual way we can define powers of the matrix, and so can define(H(θ)V (θ)H(θ)′)−1/2. Then let S = (H(θ)V (θ)H(θ)′)−1/2Z ∼ N(0, Ir). It then follows that

Z ′(H(θ)V (θ)H(θ)′)−1Z = S′S ∼ χ2r

Our test for the Wald statistic is reject H0 is Wn > χ2r,1−α, which by construction will have asymptotic

signficance α.

27