nonlinear regression - rice university

Chapter 16

Nonlinear Regression

16.1 Model and Assumptions

16.1.1 Model

In the prequel we have studied the linear regression model in some detail. Suchmodels are linear with respect to the parameters given the explanatory variablesand have additive disturbances. The function may be nonlinear in terms of theexplanatory variable, as in a polynomial function, but with a simple redefinitionof the variables is linear in the variables given the parameters. Strictly speakingsuch models are bilinear in the parameters and explanatory variables. In thespace of functions of the explanatory variables that have parametric represen-tations this is a very small (measure zero) subspace.

We now consider the more general class of models that are intrinsically non-linear. They are necessarily nonlinear with respect to the parameters sincenonlinearity with respect to the variables only can be handled with the lin-ear model by a simple redefinition of variables. For identification reasons, werestrict our attention to functions that are additive with respect to the distur-bances. Notationally, such a model can be written

yt = h(xt1, xt2, . . . , xtk; θ1, θ2, . . . , θp) + ut

= h(xt,θ) + ut

= ht(θ) + ut

for t = 1, 2, . . . , n, where h(·) is a scalar-valued function that is nonlinear withrespect to the p × 1 vector of parameters θ. The k × 1 vector xt comprisethe independent variables of the model, and may or may not enter nonlinearly.The use of subscript-t does not necessarily indicate time-series data but is usedfor notational economy.

223

224 CHAPTER 16. NONLINEAR REGRESSION

16.1.2 Some Examples

Some simple examples will make the meaning of the types of nonlinearitiesentertained more transparent. Consider the simple Cobb-Douglas productionfunction with multiplicative disturbances

Qt = AKαt L

βt ut

which can be transformed by taking logs on both sides to become

lnQt = lnA+ α lnKt + β lnLt + lnut

andqt = a+ αkt + βlt + εt

is a linear model with qt = lnQt, kt = lnKt, lt = lnLt, εt = lnut, and a = lnA.Thus although the model appears nonlinear it is linear after a transformationand redefinition of variables.

A closely related model that is not so easily handled is Cobb-Douglas withadditive errors:

Qt = AKαt L

βt + ut

Here any transformation to make the model linear in the parameters will alsoinvolve ut in a nonlinear fashion that also involves the parameters. It is intrin-sically nonlinear. Although this may seem like a rather artificial distinction,we learned in the misspecification chapter that distinguishing between additiveand multiplicative error forms can be important.

A generalization of the Cobb-Douglas model is the CES model, with additiveerrors

yt = βo[β1x−β2

t1 + (1− β1)x−β2

t2 ]−β3/β2 + ut

which is also intrinsically nonlinear as in the last paragraph. But the samemodel with multiplicative errors:

yt = βo[β1x−β2

t1 + (1− β1)x−β2

t2 ]−β3/β2 · ut

is now also intrinsically nonlinear, since taking logs will not make the nonlinearfunction linear in the parameters. There are a host of other models, in factanything other than the bilinear form, that will also be intrinsically nonlinear.

16.1.3 Basic Assumptions

The nonlinear regression equation can be considered the specification of themodel. In addition, we will make stochastic assumptions analogous to thoseintroduced for the linear model. For reasons that will become apparent below,we cannot avoid the problem of stochastic regressors so we introduce the firstfour assumptions for the conditional zero mean case used in Chapter 11

(i) E[ut] = 0(ii) E[u2t ] = σ2

16.2. NONLINEAR LEAST SQUARES (NLS) 225

(iii) (xt, ut) jointly i.i.d.

(iv) E[ut |xt ] = 0 and E[u2t |xt ] = σ2

Note that the unconditional moment assumptions (i) and (ii) are redundantgiven (iv) but are explicitly retained for completeness. The additional as-sumptions needed to obtain asymptotic behavior will be introduced below whenneeded.

16.1.4 Some Notation

In vector notation, the model can be written

y = h(θ) + u

where y = (y1, y2, ..., yn)′, u = (u1, u2, ..., un)′, and h(θ) = (h1(θ), h2(θ), ..., hn(θ))′

are all n× 1 vectors. We denote Θ ∈ Rp as the relevant parameter space for θand introduce the notation

[H(θ)]′ =

(∂h1(θ)

∂θ:∂h2(θ)

∂θ: . . . :

∂hn(θ)

∂θ

)Q(θ) = E

[∂ht(θ)

∂θ

∂ht(θ)

∂θ′

]M = E

[u2t∂ht(θ)

∂θ

∂ht(θ)

∂θ′

].

Rather obviously, the linear regression model is a very special case where h(θ) =Xβ, H(θ) = X, and θ = β. In addition, we will sometimes use the notationz = (y,x′)′ for all the observable variables.

16.2 Nonlinear Least Squares (NLS)

16.2.1 Least Squares

In the (p + 1)-dimensional space of yt and θ, the function ht(θ) defines a p-dimensional hyper-surface, given xt. Analogous to the linear model, we seekto find estimates that make the hyper-surface close to the values of yt by somemeasure. Specifically, we find θ such that

ψn(θ) = n−1n∑t=1

(yt − ht(θ))2

is minimized so

θ = argminθ∈Θ ψn(θ)

where, as mentioned above, Θ is the set from which we are choosing.


As with the linear model, in practice, we find the estimates as the solutionto the first-order conditions

0 =∂ψn(θ)

∂θ= − 2

n

n∑t=1

(yt − ht(θ))∂ht(θ)

∂θ

= − 2

nH(θ)′(y − h(θ)).

Note that as a consequence of h(θ) being nonlinear in θ, H(θ) will also bea function of θ, and we must find the roots of a p-equation nonlinear system.This is the essential complication in the nonlinear model relative to the linear.

16.2.2 Gauss-Newton Method

In Newton’s method for solving a nonlinear equation, a linear approximation tothe equation is solved and use iteratively to obtain a solution to the nonlinearproblem. For the nonlinear least-squares problem we will take an analogousapproach. We first expand ht(θ) in a linear Taylor series (about some initialestimate θi) to obtain

h(xt,θ) = h(xt,θi) +

h(xt,θi)

∂θ(θ − θi) +Rit

where Rit is the remainder term. Substitution of this expression into the non-linear regression and rearranging yields

yt − h(xt,θi) =

h(xt,θi)

∂θ(θ − θi) +Rit + ut.

for t = 1, 2, ..., n.In vector/matrix notation we have

y − h(θi) = H(θi)(θ − θi) + Ri + u

where Ri is the vector of remainder terms for all observations. Treating θi asgiven, we define yi = y − h(θi), Hi = H(θi), δi = (θ − θi), and ui = Ri + u.We now can write the model in the form

yi = Hiδi + ui

which has every appearance of the linear regression model. Applying leastsquares to this linear model yields (assuming Hi has full column rank)

δi

= (Hi′Hi)−1Hi′yi

or

θ − θi = (H(θi)′H(θi))−1H(θi)′(y − h(θi))

16.2. NONLINEAR LEAST SQUARES (NLS) 227

where θ = θi + δi

represent least squares estimates of θ after linearizing thesystem around the initial values θi.

Define θi+1 = θ, then we can rearrange the least square estimation equationfor the linearized model as the iterative equations

θi+1 = θi + (H′H)−1H′(y − h)∣∣θi

where |θi indicates that H and h have been evaluated at θi. Note that at eachstep, we are simply regressing y − h(θi) on H(θi). Suppose we iterate untilθi+1 is unchanged from θi, then at convergence, θi+1 = θi and

0 = (H′H)−1H′(y − h)∣∣θi

or, since |H′H| 6= 0,0 = H′(y − h)|θi

and the first order conditions are satisfied. This procedure is called the Gauss-Newton procedure and is an application of Newton’s method since we are lin-earizing the first-order conditions and finding the solution and iterating.

Several practical issues arise in implementing the Guass-Newton procedureor any nonlinear interative estimator for that matter. First is the possibilitythat multiple roots may exist for the first-order conditions. In such cases boundscan sometimes be placed on the estimates on the basis of what is economicallyreasonable, such as values that lie between 0 and 1, for example. And in somemodels it is possible to state categorically that the objective function is globallyconcave so there is only one root, such as in variations on the binary probabilitymodel.

A second issue is the question of convergence. The iterations of Newton’smethod are known to converge at a quadratic rate in the neighborhood of thesolution if the first derivative of the function to be solved exists. Specifically,for θ scalar and θ the solution, this means (θi+1 − θ) = O((θi − θ)2) so if

(θi − θ) is small then (θi+1 − θ) is much smaller and convergence is assured.It is entirely possible that convergence will not occur in certain regions of theparameter space. An example of such a problematic region would occur if thefunction had a saddle point which was both a maximum in one direction anda minimum in another. The bounds discussed in the previous paragraph cansometimes be useful in ruling out such regions.

Sometimes initial values can be taken from related preliminary estimationapproaches. One frequently encounters models and estimators where the re-searcher starts from initial consistent estimates and then iterates. For example,we could use values from the linear regression of the log-linear form of the Cobb-Douglas model to obtain starting values for use in nonlinear least squares appliedto the nonlinear form. Such an approach is also useful in avoiding the multipleroot problem.

A third issue is the problem of obtaining derivatives. A first-best solutionwould be to have analytic derivatives that are explicitly evaluated in the iter-ations. Unfortunately, obtaining such derivatives and coding them into the


algorithm is sometime infeasible, tedious, or very complicated. In such casesnumerical approximations to the derivatives may be used. It should be noted,however, that such an approach pays a price in terms of precision of the esti-mates and/or their estimated standard errors, which both use these approximatederivatives.

16.2.3 Consistency

We now turn to the properties of the estimator that come from the nonlinearleast squares exercise. It is clear from the above that the regressors in eachiteration, namely H(θi), are stochastic. As a result, the small sample prop-erties of the estimator are problematic and we must rely on the large-sampleasymptotic behavior for purposes of inference.

The following consistency result follows from verification of the high-levelconditions given in Theorem 4.5 from Chapter 4.

Theorem 16.1. Suppose (i) Θ compact; (ii) h(x;θ) continuous for θ ∈ Θwith probability 1; (iii) ∃ d(x) with (h(x;θ) − h(x;θo))2 < d(x) ∀ θ ∈ Θ and

E[d(x)] <∞; (iv) h(x;θ) = h(x;θo) for all x iff θ = θo, then θp−→ θo.

Proof: See Appendix to this chapter.

Strictly speaking, these low-level conditions are not assured, in general, andshould be verified for the nonlinear model at hand. The first two assumptionswill be relatively innocuous and easily verified for most models. The thirdassumption, which is used to prove uniform convergence in probability, can betedious to verify but will not usually present problems.

The fourth assumption assures global identifiability of the parameter vec-tor within Θ. This global identification means that we can obtain consistentestimates regardless of the starting values used. It may be unrealistic sinceit is possible that (h(x;θ) − h(x;θo)) = 0 has solutions other than θ0. Thiscondition may be weakened to local identification by substituting the condition:h(x;θ) = h(x;θo) iff θ = θo for all x and θ in a neighborhood of θo. In thiscase, consistency would follow if the initial value is itself consistent or is assuredto be in the neighborhood specified. This is a standard problem in nonlinearestimation.

16.2.4 Asymptotic Normality

The following asymptotic normality result follows from verifying the high-levelconditions of Theorem 4.6 from Chapter 4. Note that this result does notpresuppose any particular distribution of ut.

Theorem 16.2. Suppose (i)θp→ θo; (ii) θo interior to Θ; (iii) h(x;θ) twice

continuously differentiable in a neighborhood N of θo; (iv) ∃ D(z) s.t.∥∥∥∂2(y−h(x,θ))2

∂θ∂θ′

∥∥∥ <D(z) for all θ ∈ N and E[D(z)] < ∞; (v) E

[∂h(x,θ0)

∂θ∂h(x,θ0)∂θ′

]= Q exists and

is nonsingular, then√n(θ − θo) d−→N(0, σ2Q−1).

16.3. MAXIMUM LIKELIHOOD ESTIMATION 229


We do not need all the assumptions of the consistency proof only the result(i), which can occur under more general conditions. Condition (ii) is neededto exclude boundry points for which the limiting distribution cannot be normalsince it is truncated even if the distribution is collapsing. Condition (iii) isa regularity condition and easily verified. Condition (iv) is analogous to (iii)in the consistency theorem except now we are talking about estimating thematrix Q consistently. Condition (v) is needed for the limiting covariance tobe invertible and usually follows from local identification.

16.3 Maximum Likelihood Estimation

16.3.1 Normality and Likelihood Function

As with the linear model, there is a one-to-one relationship between least squaresestimation of θ and maximum likelihood estimation. Suppose

ut ∼ i.i.d. N(0, σ2)ind.xt

then

yt|xt ∼ N(h(xt,θ), σ2).

Due to independence we can write the joint density or likelihood function as

f(y|x,θ, σ2) = L(θ, σ2|y,x)

=1

(2πσ2)n/2exp

(− 1

2σ2

n∑t=1

(yt − ht(θ))2

)

and the log-likelihood function is

L(θ, σ2|y,x) = −n2

ln 2πσ2 − 1

2σ2

n∑t=1

(yt − ht(θ))2

= −n2

ln 2πσ2 − n

2σ2ψn(θ).

Obviously, maximizing this function with respect to θ is identical to minimizingψn(θ) with respect to θ. Thus, the maximum likelihood estimator of θ is θ, the

NLS estimator. Given the regularity conditions are met, this means that θ isCUAN and efficient within this class.

For estimating σ2, we find

σ2 =1

n

n∑t=1

(yt − ht(θ))2

= ψn(θ)


is the maximum likelihood estimator of σ2. It can be shown to be generallyconsistent under the assumptions of Theorem 16.1 and asymptotically normalunder general conditions. Moreover, it will attain the efficiency bound withinthe class of CUAN estimators of σ2.

16.3.2 Newton’s Method

Consider the direct maximization of the log-likelihood using Newton’s method.Suppose θ is the MLE, then solving the first-order conditions yields

0 =∂ lnL(θ)

∂θ

= − n

2σ2

∂ψn(θ)

∂θ.

Expanding this in a linear Taylor’s series around an initial value, say θi, yields

0 = − n

2σ2

∂ψn(θi)

∂θ− n

2σ2

∂2ψn(θi)

∂θ∂θ′(θ − θi) + R.

Ignoring the remainder term and solving for θ = θi+1 yields

θi+1 = θi −

(∂2ψn(θi)

∂θ∂θ′

)−1∂ψn(θi)

∂θ

where

∂ψn(θi)

∂θ= − 2

n

∑ ∂ht(θi)

∂θ(yt − ht(θi))

∂2ψn(θi)

∂θ∂θ′= − 2

n

[n∑t=1

(yt − ht(θi))

∂2ht(θi)

∂θ∂θ′− ∂ht(θ

i)

∂θ

ht(θi)

∂θ′

.

]

Note that the weight matrix for this approach differs from the Gauss-Newtoncase.

16.3.3 Method of Scoring

A variation on Newton’s method that is frequently advocated is the method ofscoring. First we note that the weight matrix is an average of an i.i.d. randommatrix and will converge in probability to its expectation. At the target, thisexpectation can be written as the average of a simpler expression,

∂2ψn(θ0)

∂θ∂θ′p−→ E[2

∂ht(θ0)

∂θ

ht(θ0)

∂θ′] = 2Q.

16.3. MAXIMUM LIKELIHOOD ESTIMATION 231

In the method of scoring we estimate the expectation of this simpler expressiondirectly with an average and use that as the weight matrix. Thus we have

θi+1 = θi −

(2

n

n∑t=1

∂ht(θ)

∂θ

ht(θ)

∂θ′

)1

n

∂ψn(θi)

∂θ

= θi +

[n∑t=1

ht(θi)

∂θ

ht(θi)

∂θ′

]−1 n∑t=1

∂ht(θi)

∂θ(yt − ht(θ))

This is, of course, the Gauss-Newton method. Experience suggests that thismay be better behaved since it imposes restrictions of the model on the weightmatrix that follow from examination of the expectation.

16.3.4 Controlling Step-Size

The approach above essentially approximates the nonlinear minimization by aquadratic and then minimizes that. If the problem is truly quadratic it willconverge in one step. If not, then the step will hopefully move us toward theminimum. Sometimes it does not and instead overshoots the target or getsinto a cycle where no progress occurs. This suggests that other techniques beadopted. However, the above quadratic approximation is known to work quitewell near a unique minimum so it should be incorporated into any proposedsolution.

It is useful to think of the step and its size in terms of gradients. Now∂ψn(θ)∂θ is the gradient of the minimand and in the space of the parameters can be

thought of as a direction vector that tells us how much the minimand changes fora unit increase in each of the parameters. In the neighborhood of the minimumthis direction will point uphill away from the minimum and its negative will bethe direction of steepest descent for a unit step-size. Premultiplication by theweight matrix is introduced to allow for the second-order curvature and modifiesthe direction accordingly. It is the optimal step-size when we approximate theproblem by a quadratic.

The problem is that the derivative implicitly involves a unit step-size andthat can result in overshooting or even cycling if the function is locally verynon-quadratic. A solution is to introduce a step-size parameter

θi+1 = θi + λ(H′H)−1H′(y − h)∣∣θi

where 0 ≤ λ ≤ 1. There is a great deal of artwork in the choice of λ. Formany problems the unit step-size works well. If it does not a simple solution isto use the halving step-size algorithm. In this approach, you start with λ = 1and if the proposed step-size does not result in a reduction in the minimand youuse λ = .5 and try again. You continue with this halving until a reduction isobtained. If a reduction cannot be obtained then the minimand has a flat spotand the minimum is not unique.


16.4 Inference

16.4.1 Standard Normal Tests

First we consider testing a scalar null hypothesis. Now, the convergent iterateθ will, by Theorem 16.2, satisfy

√n(θ − θo) d−→ N(0, σ2Q−1)

where

Q(θ) = E

[∂ht(θ)

∂θ

∂ht(θ)

∂θ′

]Q = Q(θ0).

Thus, √n(θj − θoj )√σ2 [Q−1]jj

d−→ N(0, 1)

where[Q−1

]jj

indicates, as in previous chapters, the jj-th element of the brack-

eted matrix.A null hypothesis might provide θoj which leaves us needing estimates of σ2

and Q. An obvious estimator of σ2 is

σ2 =1

n

n∑t=1

(yt − h(xt, θ))2.

= ψn(θ)

An alternative is

s2 =1

n− p

n∑t=1

(yt − h(xt, θ))2

=n

n− pσ2.

A natural estimator for Q is

Q =1

nH(θ)′H(θ)

=1

n

n∑t=1

∂ht(θ)

∂θ

ht(θ)

∂θ′

which is the weight matrix and is provided as a byproduct of Gauss-Newtonestimation.

Consistency of the variance estimator should follow from the conditions ofTheorem 16.1. We need an additional smoothness assumption to guaranteeconsistency of the covariance matrix estimator. Specifically, we have

16.4. INFERENCE 233

Theorem 16.3. If the hypotheses of Theorems 16.1 and 16.2 are satisfied and

∃ δ(z) s.t.∥∥∥∂h(x,θ)∂θ

∥∥∥2 < δ(z) for all θ in a neighborhood N of θ0 and E[δ(z)] <

∞, then σ2 →p σ2, s2 →p σ

2 and Q→p Q.


Thus, a consistent estimator of the limiting covariance matrix is s2Q−1.Substituting this into the above ratio, we find

√n(θj − θoj )√s2[Q−1

]jj

=(θj − θoj )√

s2[H(θ)′H(θ)

]jj

d−→ N(0, 1),

under the null hypothesis H0 : θj = θoj . Note that the denominator of the middleexpression above is the usual least-squares calculated standard error from thefinal interation.

Under an alternative hypothesis, say H1 : θj = θ1j 6= θoj , we have

√n(θj − θoj )√s2[Q−1

]jj

=

√n(θj − θ1j )√s2[Q−1

]jj

+

√n(θ1j − θoj )√s2[Q−1

]jj

= N(0, 1) +

√n(θ1j − θoj )√σ2 [Q−1]jj

+ op(1)

so we see that the statistic will diverge in the positive/negative direction at therate

√n depending on whether (θ1j − θoj ) is positive/negative. The test will

therefore be consistent.

16.4.2 Likelihood Ratio Statistics

Suppose that we seek to test the null hypothesis H0 : r(θ0) = 0 where r(·) is aq × 1 continuously differentiable vector function. Since we have the maximumlikelihood estimator we can directly apply the results of Chapter 5. There weintroduced the trinity of tests for a vector restriction. They were the likelihoodratio, Wald and Lagrange multiplier tests. It is instructive to present thesetests in some detail for the present case as with the linear regression model.

First, we consider the likelihood ratio statistic. Suppose θ denotes theunrestricted maximum likelihood estimator used in the Wald-type test and θ isthe restricted maximum likelihood estimator, then r(θ) = 0. Let u = y−h(θ)

and u = y − h(θ) denote the unrestricted and restricted residuals, then σ2 =u′u/n and σ2 = u′u/n are the unrestricted and restricted maximum likelihoodestimators of σ2.

Now substitution of σ2 into the unrestricted log-likelihood function yields


the concentrated unrestricted log-likelihood function

Lu = −n2

ln 2πσ2 − 1

2σ2ψn(θ)

= −n2

ln 2πσ2 − 1

2σ2e′ueu

= −n2

ln 2π − n

2ln σ2 − n

2.

Similarly the concentrated restricted log-likelihood is

Lr = −n2

ln 2π − n

2ln σ2 − n

2.

Thus we have the likelihood ratio statistic and its usual limiting distribution

LR = 2(Lu − Lr)

= 2(−n2

ln σ2 +n

2ln σ2)

= n lnσ2

σ2

d−→ χ2q.

There is a close relationship between the LR statistic and the familiar F -statistic comparing restricted and unresticted sums-of-squares. Note that

σ2

σ2=

u′u/n

u′u/n= 1 +

u′u− u′u

u′u

= 1 +q

n− p(u′u− u′u)/q

u′u/(n− p)

= 1 +q

n− pF

The likelihood ratio statistic is obviously a nonlinear monotonic function of theF -statistic. And in the limit the relationship becomes linear. Expand thestatistic as a function of σ2 about σ2 to obtain

LR = n1

σ2/σ2

1

σ2(σ2 − σ2) +Op(n(σ2 − σ2)2)

= n(σ2 − σ2)

σ2+ op(1)

=u′u− u′u

σ2+ op(1)

since (σ2 − σ2) = Op(1/n) under the null. If we divided the numerator ofthe leading term by the number of restrictions, we would have the nonlinearregression analog to the F -statistic in the linear regression model except we areusing the restricted estimator of σ2 in the denominator, which will make nodifference asymptotically, under the null.

16.4. INFERENCE 235

16.4.3 Wald-Type Tests

The asymptotic normality result may be utilized to obtain a quadratic formbased test of the Wald type. Using the delta method we find that

√nr(θ)

d−→ N(0, σ2RQ−1R′)

and

nr(θ)′[σ2RQ−1R′]−1r(θ)d−→ χ2

q

where R(θ) = ∂r(θ)/∂θ′, R = R(θ0), and RQ−1R′ nonsingular. Consistent

estimates of σ2 and Q are introduced above, and R = R(θ) will be consistentfor R. Thus, under the null hypothesis, we have the feasible Wald-type statistic

W = nr(θ)′[σ2RQ−1R′]−1r(θ)

= r(θ)′[σ2R(H′H)−1R′]−1r(θ)d−→ χ2

q.

Under the alternative hypothesis H1 : r(θ0) 6= 0, the statistic will diverge inthe positive direction at the rate n.

It is also possible to restate the Wald-type test in terms of sums-of-squaresof the restricted and unrestricted models and thereby see the close connectionbetween LR and W tests. Without loss of generality, we can reparameterizethe model so the null restriction is a zero restriction. Let θ1 denote the firstp − q elements of the parameter vector and θ2 the final q elements. Letθ∗2 = r(θ1,θ2) and suppose the q × 1 restriction function is one-to-one with θ2.Then we can obtain the inverse function θ2 = r−1(θ1,θ

∗2) and the regression

equation becomes

yt = h(xt,θ1,θ2) + ut

= h(xt,θ1, r−1(θ1,θ

∗2)) + ut

= h∗(xt,θ1,θ∗2) + ut

and the null hypothesis becomes H0 : θ∗2 = 0. Note that this renormalizationof the restriction will leave the likelihood ratio statistic unchanged since MLE isinvariant to normalizations. Also note that R = (0, Iq), the derivative matrixfor the renormalized restrictions, will now be a constant and nonstochastic.

Using the approach of Section 8.4.3 for the linear model on the nonlinear


model we find

r(θ)′[σ2R(H′H)−1R′]−1r(θ)=θ

′2[σ2(H′2M1H2)−1]−1θ2

=1

σ2θ′2H′2M1H2θ2

=1

σ2θ′2H′2M1M1H2θ2

=1

σ2(M1u− u)

′(M1u− u) + op(1)

=1

σ2(u′M1u− 2u′M1u + u′u) + op(1)

=1

σ2(u′M1u− u′u) + op(1)

=1

σ2(u′u− u′u) + op(1)

where H = (H1 : H2), M1 = In−H1(H′1H1)−1H′1 and expansion of h(θ) aboutθ0 yields

M1u = M1(y − h(θ))

= M1(y − (h(θ0) + H1(θ1 − θ

0

1)+H2θ2 +Op(∥∥∥θ − θ0∥∥∥2)))

= M1u + M1H2θ2 + op(1).

Thus the leading behavior of the W statistic is given by

1

σ2(u′u− u′u)

d−→ χ2q

which, if divided by q, is the nonlinear analog to the F -statistic and differsfrom the LR only in the denominator. In fact, as asserted more generally inChapter 5, the W and LR tests, beyond having the same limiting distribution,are asyptotically identical in the sense that they will accept and reject together,under the null, in large samples.

16.4.4 Lagrange Multiplier Test

The Lagrange multiplier test examines the unrestricted scores (first derivativesof the log-likelihood function) evaluated at the restricted estimates. If therestriction is correct then the average of the scores will converge in probabilityto zero and appropriately normalized be asymptotically normal. For this modelwe have

∂ lnL(θ)

∂θ== − 1

2σ2

∂ψn(θ)

∂θ.

For simplicity, we consider the case where the restrictions have been rewritten

in zero form as discussed above, so θ = (θ′1,0′)′.

16.4. INFERENCE 237

Now, by definition, the restricted estimates yield

0 =∂ lnL(θ)

∂θ1

=1

σ2

n∑t=1


∂θ1,

so for the LM statistic, we only consider the final r elements of the score

∂ lnL(θ)

∂θ2=

1

σ2

n∑t=1


∂θ2.

Premultiplying both of the above scores by 1/√n and expanding around θ01

yields repectively

0 =1

σ2[

1√n

n∑t=1

ut∂ht(θ

0)

∂θ1

+1

n

n∑t=1

(ut∂2ht(θ

0)

∂θ1∂θ′1

− ∂ht(θ0)

∂θ1

∂ht(θ0)

∂θ′1

)√n(θ1 − θ01) + op(1)]

=1

σ2

1√n

n∑t=1

ut∂ht(θ

0)

∂θ1− 1

σ2Q11

√n(θ1 − θ01) + op(1)

and

1√n

∂ lnL(θ)

∂θ2=

1

σ2[

1√n

n∑t=1

ut∂ht(θ

0)

∂θ2

+1

n

n∑t=1

(ut∂2ht(θ

0)

∂θ2∂θ′1

− ∂ht(θ0)

∂θ2

∂ht(θ0)

∂θ′1

)√n(θ1 − θ01) + op(1)]

=1

σ2

1√n

n∑t=1

ut∂ht(θ

0)

∂θ2− 1

σ2Q21

√n(θ1 − θ01) + op(1).

Solving the first for√n(θ1−θ01) and substituting into the second yields

1√n

∂ lnL(θ)

∂θ2=

1

σ2[

1√n

n∑t=1

ut∂ht(θ

0)

∂θ2

+1

n

n∑t=1

(ut∂2ht(θ

0)

∂θ2∂θ′1

− ∂ht(θ)

∂θ2

∂ht(θ)

∂θ1

)√n(θ1 − θ01) + op(1)]

−→d N(0,1

σ2Q22)

where Q22 = (Q22 −Q21Q−111 Q′21). And

1

n

∂ lnL(θ)

∂θ′2σ2Q

−122

∂ lnL(θ)

∂θ′2−→d χ

2q.


For Q22 = 1nH′2M1H2 −→p Q22 and tilde indicates the restricted estimates

were used to evaluate the matrix, consider the feasible analog

1

n

∂ lnL(θ)

∂θ′2σ2Q

−122

∂ lnL(θ)

∂θ′2=

1

n

1

σ2u′M1H2[σ2(

1

nH′2M1H2)−1]

1

σ2H′2M1u

=1

σ2u′M1H2(H′2M1H2)−1H′2M1u

=1

σ2[u′M1u− u′(M1 − M1H2(H′2M1H2)−1H′2M1)u]

=1

σ2(u′u− u′u)+op(1)

under the null hypothesis. The leading term is the same as that of the likelihoodratio statistic.

16.5 Example

We consider the estimation of a prodution function. Specifically, we considerthe generic problem

Qi = h(Ki, Li, ui; θ)

where Qi is firm output, Ki is firm input of capital services, Li is firm input oflabor services, ui is a random error, and θ is the parameter vector. The datafor the problem are 25 observations taken from Kmenta.

We first entertain the possibility that the specification should be Cobb-Douglas with multiplicative errors. In this case, we have

Qi = θ1Kθ2i L

θ3i e

ui

or in log-linear formqi = a+ αki + βli + ui

where qi = lnQi, ki = lnKi, and li = lnLi. The regression results for this dataare:

qi = 2.4811(0.1286)

+ 0.6401(0.0347)

ki + 0.2573(0.02696)

li + ei.

Since there is a possible issue with respect to whether the error is multiplicativeor additive, and hence the possibility of heteroskedasticity, we calculate theWhite-test for the regression and obtain W = 3.1106. This yields a p-value of.68294 and there seems to be no evidence that heteroskedasticity impacts theinferences.

It is of interest in any production function whether constant returns to scaleapply. For this specification, this implies α+ β = 1. For the log-linear modelthis can best be tested following a reparameterization:

(qi − li) = a+ α(ki − li) + (α+ β − 1)li + ui

= a+ α(ki − li) + β∗li + ui.

16.5. EXAMPLE 239

The null hypothesis of α+β = 1 becomes β∗ = 0 for the reparameterized model.The regression results for this model are

(qi − li) = 2.4811(0.1286)

+ 0.6401(0.0347)

(ki − li)− 0.1026(0.0503)

li + ui.

The t-ratio for β∗ for testing its’ being zero is -2.037, which is significant if weconsult the asymptotically appropriate standard normal tables. If we consultthe tables for the t-distribution it is not quite significant. So there is marginalevidence of non-constant returns to scale.

Now we turn to the Cobb-Douglas model with additive errors:

Qi = AKαi L

βi + ui.

Estimation by NLS yields

Qi = 11.366K .6307i L.2753i + ei

with SSE = 1258.06 and the standard errors of the coefficient estimates re-spectively (1.620, 0.380, 0.0282). Note that coeffient estimates and the t-ratiosfor α and β are very similar to those for the log-linear model. The White-testyields W = 10.087, which has a p-value of .0728. This is not significant butdoes indicate that there is weak evidence of heteroskedasticity and hence thenon-linear model might have a misspecified error term. On this basis one mightprefer the log-linear model.

A test for constant returns to scale can be also be conducted for the non-linear model. We reparameterize to obtain

Qi = A(Ki/Li)αLα+βi + ui

= A(Ki/Li)αLβ

∗

i + ui.

The null that α+ β = 1 therefore becomes α∗ = 1. Performing this regressionyields the same estimates and standard errors for A and α, while the estimateand standard error for β∗ are 0.9060 and 0.05268. A t-ratio for testing the nullthat β∗ = 1 is −1.7845, which is not significant for a two-sided alternative.

Finally, we consider estimation of the more general CES production function.Specifically, we consider

Qi = θ1(θ2Kθ3i + (1− θ2)Lθ3i )θ4/θ3 + ui.

Estimation by NLS yields

Qi = 11.213(0.4053K−0.5963i + (1− 0.4053)L−0.5963i )−0.82718/0.5963 + ei

with SSE = 979.958 and the standard errors of the coefficient estimates re-spectively (1.426, 0.1436, 0.3017, 0.0560). The White test is 16.93, which ismarginally significant and indicates that heteroskedasticity may be compromis-ing the inferences on the coefficients.


Constant returns to scale for this model implies θ3 = 1, which is not re-jected when we calculate the appropriate t-ratio. The CES approaches theCobb-Douglas model in the limit as θ3 → 0. The CES is not well-defined inthis limiting case, however, some indication of which model works best can beobtained by forming the likelihood-ratio statistic. We have

LR = n ln(σ2

σ2)

= 25 ln(57.1844

46.665)

= 5.082

which is significant at the 5% level for a χ21, which is the appropriate limiting

distribution under the null. This indicates that the CES is preferable to theCobb-Douglas.

16.A Appendix

We will first prove consistency by verifying the conditions of Theorem 4.5 fromthe Appendix of Chapter 4. For purposes of the proof below, we define ψn(θ) =n−1

∑ni=1(yt − h(xt,θ))2 and ψ0(θ) = E[(y − h(x,θ))2] for θ ∈ Θ.

Proof of Theorem 16.1. Now (C1), compactness of Θ, is assured by condition(i). Adding and subtracting h(xt,θ

0) inside the square for ψ0(θ), expandingthe square, and taking expectations yields

ψ0(θ) = E[(y − h(x,θ0)) + (h(x,θ0)− h(x,θ))]2

= σ2 + E[(h(x,θ0)− h(x,θ))2].

Identification (C4) follows since, by condition (iv) the expectation will be zeroonly if θ = θ0. And (C3), continuity of ψ0(θ), is assured since continuity ofh(x,θ) and condition (iii) mean Newey and McFadden’s Lemma 2.4, presentedin the Appendix to Chapter 5, is satisfied for a(z,θ) = (h(x,θ0)− h(x,θ))2.

This leaves us with the uniform convergence condition (C2) to verify. Adding

and subtracting h(xt,θ0) inside the square in ψn(θ) and expanding the square

yields

ψn(θ) = n−1n∑t=1

[(yt − h(xt,θ0)) + (h(xt,θ

0)− h(xt,θ))]2

= n−1n∑t=1

u2t − n−12

n∑t=1

ut(h(xt,θ)− h(xt,θ0))

+ n−1n∑t=1

(h(xt,θ0)− h(xt,θ))2.

16.5. EXAMPLE 241

Thus, we have

ψn(θ)− ψ0(θ) = n−1n∑t=1

(u2t − σ2) + n−12

n∑t=1

ut(h(xt,θ)− h(xt,θ0))

+ n−1n∑t=1

(h(xt,θ0)− h(xt,θ))2 − E[(h(x,θ0)− h(x,θ))2]

= n−1n∑t=1

gt + n−1n∑t=1

bt + n−1n∑t=1

ct

where the definitions of gt, bt, and ct are obvious. Now, by the triangle inequal-ity,

|ψn(θ)− ψ0(θ)| ≤

∣∣∣∣∣n−1n∑t=1

gt

∣∣∣∣∣+

∣∣∣∣∣n−1n∑t=1

bt

∣∣∣∣∣+

∣∣∣∣∣n−1n∑t=1

ct

∣∣∣∣∣and

supθ∈Θ|ψn(θ)− ψ0(θ)| ≤ sup

θ∈Θ

∣∣∣∣∣n−1n∑t=1

gt

∣∣∣∣∣+ supθ∈Θ

∣∣∣∣∣n−1n∑t=1

bt

∣∣∣∣∣+ supθ∈Θ

∣∣∣∣∣n−1n∑t=1

ct

∣∣∣∣∣ .Clearly, the first term converges in probability to zero since it is not a functionof θ and is the average of an i.i.d. variable with expectation zero. And thethird term converges in probability to zero by continuity and condition (iii) andthe Lemma as above.

This leaves the second term. Let a(zt,θ) = 2ut(h(xt,θ) − h(xt,θ0)) = bt,

which is continuous at each θ ∈ Θ for zt = (x′t, ut)′. Now, by condition (iii),

|a(z,θ)| =∣∣2u(h(x,θ)− h(x,θ0))

∣∣ ≤ ∣∣∣2u(d(x))1/2∣∣∣ = d∗(z)

for all θ ∈ Θ. And, by condition (iii), d∗(z)2 = 4u2d(x)) has finite unconditionalexpectation E[d∗(z)2] = E[4u2d(x)] = 4σ2 E[d(x)], since E[u2 |x ] = σ2. ButE[d∗(z)2] < ∞ implies E[d∗(z)] < ∞ and all the conditions of the Lemma aremet. Therefore, since E[a(z,θ)] = 0,

supθ∈Θ

∣∣∣∣∣n−1n∑t=1

a(zt,θ)− E[a(z,θ)]

∣∣∣∣∣ = supθ∈Θ

∣∣∣∣∣n−1n∑t=1

bt

∣∣∣∣∣ p−→ 0,

and the second term also converges in probability to zero. Thus supθ∈Θ |ψ0(θ)− ψn(θ)| p−→0, which verifies uniform convergence.

We will now prove the asymptotic limiting normality of the nonlinear re-gression estimator by verifying the conditions of Theorem 4.6. In the following,∇θ indicates the gradient operator which yields the first derivative of its objectwith respect to θ and ∇θθ yields the second derivative matrix of a scalar object.


Proof of Theorem 16.2. Consistency and interiority (N1) are assured directlyby conditions (i) and (ii) of Theorem 16.2. And local twice continuous differen-tiability (N2) follows from condition (iii). For the nonlinear regression problemwe have

∇θψn(θ) = − 2

n

n∑t=1

∂ht(θ)

∂θ(yt − ht(θ))

so by i.i.d., E[u |x ] = 0, and E[u |x ] = σ2 we have

√n∇θψn(θ0) = − 2√

n

n∑t=1

∂ht(θ0)

∂θut

d−→ N (0, 4σ2Q)

so (N5) is satified with V = 4σ2Q. Furthermore, we have

∇θθψn(θ) = n−1n∑t=1

∂2(yt − h(xt,θ))2

∂θ∂θ′

= − 2

n

n∑t=1

(yt − h(xt,θ))∂2ht(θ)

∂θ∂θ′+

2

n

n∑t=1

∂ht(θ)

∂θ

∂ht(θ)

∂θ′

and for a(zt,θ) = ∂2(yt−h(xt,θ))2

∂θ∂θ′ the conditions of the Lemma are satisfiedlocally by i.i.d., (iii), and (iv), whereupon

supθ∈N‖∇θθψn(θ)−Ψ(θ)‖ p−→ 0

for continuous Ψ(θ) = E[∂2(yt−h(xt,θ))

2

∂θ∂θ′ ] and (N3) is satisfied. Finally, (N4) is

satisfied since Q =Q(θ0) nonsingular by (v). Thus all the conditions of Theorem4.6 are satisfied and since Ψ = Ψ(θ0) = 2Q and V = 4σ2Q we have the resultgiven in Theorem 16.2.

Proof of Theorem 16.3. In the proof of Theorem 16.1 above, we showed thatψ(θ) converges uniformly to ψ0(θ) so ψn(θ) − ψ0(θ) →p 0, whereupon by the

Slutzky theorem σ2 = ψn(θ) →p ψ0(θ0) = σ2. Let a(zt,θ) = ∂h(x,θ)∂θ

∂h(x,θ)∂θ′ ,

which is continuous (ii) of Theorem 16.1, and uniformly bounded for θ ∈ N bythe assumptions of this theorem. Then by Lemma 2.4,

supθ∈N

∥∥∥∥∥ 1

n

n∑t=1

∂ht(θ)

∂θ

∂ht(θ)

∂θ′−Q(θ)

∥∥∥∥∥ p−→ 0

and Q(θ) = E[∂ht(θ)∂θ

∂ht(θ)∂θ′

]continuous. Thus, Q − Q(θ) →p 0 and by the

Stutzky theorem, Q→p Q(θ0) = Q.

Chapter 17

Discrete DependentVariable Models

Sometimes we are interested in modeling economic processes where the outcomeis dichotomous or binary in nature. For example, the decision by a worker tojoin a union or not depending on a number of conditions such as union dues,relative salaries, and education level. Another example would be the decisionto retire from a job or not depending on age and relative incomes. In fact, anysituation where the outcomes are two states that depend on economics factors isin this category. In order to quantify the states, we arbitrarily assign a 1 to onestate and a 0 to the other state. These values would be the dependent variablesof what statisticians would call a Bernouli process. Note that the integernature of the dependent variables is different from what we had in mind in theregression models where our leading example had a continuous disturbance andhence continuous dependent variable.

17.1 Linear Probability Model

Suppose we are interested in modeling the response of the state for any obser-vation to the changes in the attributes that characterize the observation. Letyi denote the dichotomous dependent variable, which takes on a value of 1 or 0,and xi a k×1 vector of values which reflect the attributes of interest. The moststraightforward approach might seem to be a linear regression as was consideredbefore

yi = x′iβ + ui (17.1)

for i = 1, 2, ..., n, where ui is the disturbance term and β is a vector of responseparameters. Consistent with the standard regression model, we suppose thatyi and xi are jointly i.i.d. and E[ui|xi] = 0. Due to the Bernouli nature of the

243

244 CHAPTER 17. DISCRETE DEPENDENT VARIABLE MODELS

process, we find

Pr(yi = 1|xi) = Pr(yi = 1|xi) · 1 + Pr(yi = 0|xi) · 0 (17.2)

= E[yi|xi]= x′iβ,

For obvious reasons, this model is known as the linear probability model andhas been widely used due to its seeming simplicity and the easy availability ofstandard least squares software.

The problem with this approach is that the validity of standard least squarestechniques depends on the usual assumptions being met, which is unfortunatelynot the case here. The most obvious problem is that, due to the integer nature ofthe dependent variable, the disturbances cannot be normally distributed. Thismeans the favored finite sample approaches using t- and F -distributions arenot appropriate. As we found in Chapter 11, non-normality does not presentunsurmountable problems provided the sample is sufficiently large and the usualstochastic assumptions are met. Moreover, since yi is generated by a Bernouliprocess,

[u2i |xi] = var[yi|xi] = x′iβ · (1−x′iβ) (17.3)

and the disturbances must, by their very nature, be conditionally heteroskedas-tic. As with non-normality, heteroskedasticity does not present unsurmountableproblems since we learned how to handle such situations, but it does complicatematters.

Most importantly, as a probability measure, Pr(yi = 1|xi), must satisfy theconstraint

1 ≥ x′iβ ≥ 0

for all observations. Although the true coefficients may satisfy this restriction,the corresponding least-squares estimates generally will not. That is, x′iβ≥1

and 0 ≥ x′jβ for some i and j is possible. This occurs because, for the truedisturbances, ui = yi − x′iβ ≥ 0 for yi = 1 and ui = yi − x′iβ ≤ 0 for yi = 0.Least squares will systematically alter the values of β so that the squares ofsome of the more extreme values of ei = yi − x′iβ will be smaller but at theexpense of making some of the smaller ei larger with an opposite sign. Thusthe OLS estimates will be biased and inconsistent.

This is most easily seen in the bivariate case. Suppose α + βxi for β > 0satisfies the condition for all xi in the sample, which lie between xl and xu(xl ≤ xu). The corresponding values of yi will be either 0 or 1, with the the 0’spredominating for smaller values of xi and the 1’s predominating for the larger.However they are distributed, the regression line will pass through the middleof the scatter for both the 0’s and the 1’s and hence have a steeper slope andless positive intercept. This is illustrated in the following diagram.

17.2. NONLINEAR PROBABILITY MODELS 245

6 8 10 12 14 16 18 20 22−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Hou

se O

wne

rsh

ip

Income

Observations

OLS Fitted

True

Figure 17.1: Actual vs. Fitted in Linear Probablility Model

17.2 Nonlinear Probability Models

17.2.1 Latent Variable Model

A more reasonably behaved model can be motivated through the use of latent orunobserved variables. Consider a yes-no decision by a consumer as to whetherto purchase a particular item or not, or the decision by a worker to join theunion or not. We might think that the decision depends on the marginal utilitygained by the yes decision versus the no decision. If the marginal gain exceedssome threshold then the yes decision is preferrable and we will observe such anaction. Let y∗i denote this gain, which results from a linear data generatingprocess such as we utilized for the linear regressions model. Specifically, we have

y∗i = x′iβ∗ + u∗i (17.4)

where (x′i, u∗i ) are jointly i.i.d. and β∗ is the k × 1 vector of parameters of

interest. We also assume u∗i independent of xi E[u∗i ] = 0, and E[u∗2i ] = σ2.The problem with this regression model is that y∗i is not directly observableinstead we observe yi which results from a decision process based on y∗i . Thisis an example of latent variable models, which play an extremely important rolein applied microeconometrics.

If β∗ includes an intercept then, without loss of generality, we can subsumethe threshold into the intercept and so assume the threshold is zero, is whichcase we observe

yi =

1 if y∗i ≥ 00 otherwise

. (17.5)


Let wi = −u∗i /σ then wi is i.i.d. independent of xi with [wi] = 0, and E[w2i ] = 1,

whereupon we can rewrite (17.4) as

yi =

1 if wi ≤ x′i(β

∗/σ)0 otherwise

.

Thus, due to independence,

Pr(yi = 1|xi) = Pr(wi ≤ x′i(β∗/σ))

=

∫ x′i(β∗/σ)

−∞fw(w)dw

= Fw(x′i(β∗/σ))

where Fw(·) is the CDF of w. Of course, Pr(yi = 0|xi) = 1 − Fw(x′i(β∗/σ))

since the alternatives are mutually exclusive.

17.2.2 Nonlinear Regression Model

This finding can be used to write a nonlinear regression modeol. As above, wecan write

E[yi|xi] = Pr(yi = 1|xi) · 1 + Pr(yi = 0|xi) · 0= Fw(x′i(β

∗/σ)).

Unexpecting this equation yields

yi = Fw(x′iβ) + ui (17.6)

for i = 1, 2, ..., n, where β = β∗/σ and ui = yi−Fw(x′iβ), which by constructionhas zero conditional mean. The regression model given in (17.5) is an exampleof a single-index model, which may arise in other contexts. The conditionalexpectation of yi depends on xi only through the scalar linear function x′iβfeeding into the nonlinear CDF function Fw(·).

Note that this regression solves the need to satisfy the constraint 1 ≥ Pr(yi =1|xi) ≥ 0 by construction since Fw(·) is a CDF. Also note that we only see β∗

in ratio with σ, which means that the coefficients are only identifiable up to ascale factor. This is a normalization of the CDF to reflect unit variance. Inthe absence of the normalization, the parameter vector β∗ and standard errorσ are not separately identifiable.

One might be tempted to apply the results from the previous chapter at thispoint and estimate the model and perform inference via nonlinear least squares.Unfortunately we are still left with the problem of heteroskedasticity since

E[w2i ] = 1[u2i |xi] = var[yi|xi] = Fw(x′iβ) · (1−Fw(x

′iβ)). (17.7)

This means that any inferences based on the standard statistic yielded by NLSwill be suspect. Although we could possible obtain robust standard errors or


attempt to correct this problem via a weighted least squares approach, we havea more basic problem in that the proof of consistency for NLS assumed that theerror term is independent of the regressors, which is obviously not true for thisformulation.

Nonetheless, the NLS estimator for this model can be shown to be consistentunder less restrictive restrictions than before. Let B denote the space of thecoefficient vector β, then we have

Theorem 17.1. Suppose (i) B compact, (ii) Q = E[xx′] nonsingular, and (iii)

Fw(·) everywhere continuous and strictly monotonically increasing, then βp−→

β0.

Proof. The proof proceeds by verifying the four conditions (C1)-(C4) of The-orem 4.5. The compactness condition (C1) is assured by Assumption (i) ofthe current theorem. For the present problem, we have ψn(β) = 1

n

∑ni=1(yi −

Fw(x′iβ))2 and

ψ0(β)= E[(y − Fw(x′β))2]

= E[((y − Fw(x′β0))− (Fw(x′β)− Fw(x′β0)))2]

= E[Fw(x′iβ0) · (1−Fw(x

′iβ

0))]+ E[(Fw(x′β)− Fw(x′β0))2].

The first term does not depend on β and the second attains a minimum ofzero at β = β0. By Assumption (ii) of the current theorem, for any c 6= 0,c′Q = E[c′xx′] =

∫c′xx′fx(x)dx 6= 0, which means c′x 6=0 for x with positive

probability measure. Let c =(β − β0), then β 6= β0 implies, for some x withpositive mesure, x′β 6= x′β0 and hence Fw(x′β) 6= Fw(x′β0) due to strict mono-tonicity from Assumption (iv). It follows that (Fw(x′β)− Fw(x′β0))2 > 0 forsome x with positive measure so E[(Fw(x′β) − Fw(x′β0))2] > 0 and ψ0(β) at-tains a unique minimum at β = β0, which means the identification condition(C4) is satisfied. This leaves us with the uniform convergence condition (C3)and the continuity condition (C3) to verify. Let a(z, β) =(y − Fw(x′β))2 thena(z, β) everywhere continuous by assumption (iii) and a(z,β) ≤1 by construc-tion. Thus the conditions of Newey and McFadden’s Lemma 2.4, presented inthe Appendix to Chapter 5, are satisfied whereupon ψ0(β) = E [a(z,β)] = E[(y−Fw(x′β))2] is continuous and supβ∈B

∥∥n−1∑ni=1 a(zi,β)− E[a(z,β)]

∥∥ p−→ 0, soconditions (C3) and (C4) are satisfied.

Assumption (ii) is the same as introduced for identification in the stochasticregressor version of the linear least squares case so has no added cost. As-sumptions (iii) and (iv) on the CDF are met by the most common choices ofdistribution for the latent disturbances: the normal and logistic.

17.2.3 Weighted Nonlinear Least Squares (WNLS)

Given its consistency, it is straightforward to develop the asymptotic normalityof the NLS estimator of the binomial probability model. Since the errors are


heteroskedastic, we should utilize a robust covariance estimator, which correctsthe standard errors for the heteroskedasticity, in conducting inference. Ratherthan work out the details of such an approach, we will correct for heteroskedas-ticity directly in the estimation step as with weighted least squares. Supposethat

λ2i = E[u2i |xi] = Fw(x′iβ0) · (1−Fw(x

′iβ

0))

were known then we can transform the regression equation (17.5) to obtain thetransformed model

1

λiyi =

1

λiFw(x′iβ)+

1

λiui

or

yi = Fw(x′iβ)+ui

where yi = 1λiyi, Fw(x′iβ) = 1

λiFw(x′iβ), and ui = 1

λiui. The transformed

disturbance ui will have zero mean and unit conditional variance and hence behomoskedastic. We then proceed to perform NLS of this transformed model,which should have the same properties as the least squares estimator aboveunder similar assumptions.

The complication with this approach, of course, is that we don’t know λ2isince β is the target of our estimation exercise. It seems natural to proceedwith a two-step approach at this point. In the first step we perform NLS toobtain β, which is then used to calculate

λ2i = Fw(x′iβ) · (1−Fw(x′iβ))

and the feasible transformed model

yi =Fw(x′iβ)+ui

where yi = 1

λiyi,

Fw(x′iβ) = 1

λiFw(x′iβ), and ui = 1

λiui. We now apply NLS

taking λi as given, to obtain β as the solution to the first-order conditions

0 = −2

n∑i=1

(yi − Fw(x′iβ))∂Fw(x′iβ)

∂β(17.8)

= −2

n∑i=1

1

λ2i(yi − Fw(x′iβ))

∂Fw(x′iβ)

∂β

= −2

n∑i=1

1

Fw(x′iβ) · (1−Fw(x′iβ))

(yi − Fw(x′iβ))fw(x′iβ)xi

since∂Fw(x′iβ)

∂β = fw(x′iβ)xi where fw(·) is the PDF of w. This estimator shouldbe asymptotically efficient relative to the NLS estimator above and equivalentto an estimator based on the true values of λ2i . There might be some appeal

for using the new more efficient estimator to recalculate λ2i , yi, Fw, and ui and


perform NLS again. This approach could be iterated until the values of β donot change.

We will bypass the details of the asymptotic behavior of these approachesfor reasons that become obvious in a few paragraphs.

17.2.4 Maximum Likelihood Estimation (MLE)

Given that we have completely specified the distributional structure of the modelit is natural to develop its’ maximum likelihood estimator. Since Pr(yi =1|xi) = Fw(x′iβ) and Pr(yi = 0|xi) = 1 − Fw(x′iβ) then the joint probabilitystructure of all the observations or the likelihood function is given by

Ln(β) = (∏yi=1Fw(x′iβ))(

∏yi=0[1− Fw(x′iβ)])

=

n∏i=1

Fw(x′iβ)yi [1− Fw(x′iβ)]1−yi

=

n∏i=1

f(yi|xi;β)

and the log-likelihood is

Ln(β) =1

nln(L(z; )) =

1

n

n∑i=1

[yi ln(Fw(x′iβ)) + (1− yi) ln(1− Fw(x′iβ))]

(17.9)

=1

n

n∑i=1

ln f(yi|xi;β)

where f(y|x;β) =Fw(x′β)y[1 − Fw(x′β)]1−y is the density of y given x. Themaximum likelihood estimator is taken as the solution to the first-order condi-tions

0 =∂L(zi; β)

∂β(17.10)

=1

n

n∑i=1

[yi1

Fw(x′iβ)fw(x′iβ)xi − (1− yi)

1

1− Fw(x′iβ)fw(x′iβ)xi]

=1

n

n∑i=1

[yi

Fw(x′iβ)(1− Fw(x′iβ))(1− Fw(x′iβ))fw(x′iβ)xi

− 1− yiFw(x′iβ)(1− Fw(x′iβ))

Fw(x′iβ)fw(x′iβ)xi]

=1

n

n∑i=1

1

Fw(x′iβ)(1− Fw(x′iβ))(yi − Fw(x′iβ))fw(x′iβ)xi.

Comparison with (17.7) reveals that β solving these equations will be the sameas the weighted NLS estimator resulting when we iterate to convergence. So the


iterated weighted NLS approach is a feasible numerical approach for obtainingthe MLE of this model.

In order to establish general consistency of MLE we need to add a uniformboundedness in probability condition, which will be shown to be superfluous forthe specific Logit and Probit cases conisdered in the next section.

Theorem 17.2. Suppose (i) B compact, (ii) Q = E[xx′] nonsingular, and (iii)Fw(·) everywhere continuous and strictly monotonically increasing, and (iv)there is d(z) with |ln f(y|x;β)| ≤ d(z) for all β ∈ B and E[d(z)] < ∞, then

βp−→ β0.

Proof. The proof proceeds by verifying the five conditions (i-v) of Theorem5.1. Condition (i) is satisfied by construction with f(yi|xi;β) =Fw(x′iβ)yi [1−Fw(x′iβ)]1−yi . From the previous NLS proof, we know by (ii) and (iii) abovethat β 6= β0 implies Fw(x′β) 6= Fw(x′β0) and likewise 1 − Fw(x′β) 6= 1 −Fw(x′β0) so f(y|x;β) =Fw(x′β)y[1−Fw(x′β)]1−y 6= Fw(x′β0)y[1−Fw(x′β0)]1−y =f(y|x;β0) for x with positive measure, so condition (ii) is met. The compact-ness condition (iii) is assured by Assumption (i) of the current theorem. Forthe present problem, continuity of ln f(y|x;β0) follows from (iii), so condition(iv) is met. Finally, condition (v) is implied directly by Assumption (iv) of thecurrent theorem.

Theorem 17.3. Suppose (i) β −→p β0, β0 interior B, (ii) Fw(w) twice dif-ferentiable and strictly positive in a neighborhood N of β0, (iii) fw(w) and|dfw(w)/dw| bounded for all w, (iv) ϑ = E[∂ ln f(y|x;β0)/∂β·∂ ln f(y|x;β0)/∂β′]

exits and nonsingular, (v) E[supβ∈N

∥∥∥∂2 ln f(y|x;β))∂β∂β′

∥∥∥] <∞, then√n(β−β0) −→d

N(0,ϑ−1).

Proof. In this proof, we will verify the conditions of Theorem 5.2. Conditions(i), (ii), and (iii) there are assured by (u∗,x) jointly i.i.d., and conditions (i)and (ii) above. By (iii) above both ∂f(y|x;β)/∂β′ = (2y − 1)fw(x′β)x and∂2f(y|x;β))∂β∂β′ = (2y − 1)dfw(x′β)

dw xx′ can be shown to be bounded in norm by

integrable functions of functions of x so condition (iv) of Theorem 5.2 is satisfied.Finally, (iv) and (v) above are simply conditions (v) and (vi) of Theorem 5.2with a suitable redifinition of terms.

17.3 Probit and Logit

In order to proceed with any of the above approaches we have to specify thedensity of the latent disturbance. An obvious choice is to assume u∗ ∼ N(0, σ2)so w ∼ N(0, 1), whereupon

fw(z) = ϕ(z) =1√2πe−

12 z

2

(17.11)

Fw(·) = Φ(·) = .5[1 + erf(z√2

)]

17.3. PROBIT AND LOGIT 251

where ϕ(z) and Φ(z) denote the standard normal PDF and CDF. The nonlinearregression model based on the standard normal CDF is called the Probit model.An attractive alternative is to assume u∗ ∼ Logistic(0, σ2) so w ∼ Logistic(0, 1),whereupon

fw(w) =ew/k

k(1 + ew/k)2(17.12)

Fw(z) =1

1 + e−w/k

for k =√

3/π. We will call the regression model based on this second formula-tion the standard Logit model.

The standard logistic density is bell-shaped like the normal but slightly morepeaked and thicker-tailed but thinner in the shoulders as seen in Figure 17.2.

−4 −3 −2 −1 0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

f(x)

Normal

Logistic

Figure 17.2: Standard Normal and Logistic Densities

For our current purposes, interest naturally focuses on the CDF’s of the distri-butions which are even more similar than the densities, as seen in Figure 17.3.In consequence of this similarity of CDF’s, we might expect the estimates to besimilar from the two alternatives, when they are normalized in the same fashion.A frequent justification for using the Logit approach rather than Probit is thatthe CDF is available in closed form although this rationalization may be calledinto question.


−4 −3 −2 −1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

f(x)

Normal

Logistic

Figure 17.3: Standard Normal and Logistic Distributions

Other choices of normalization for the CDF and hence β∗ are possible, ofcourse, which will rescale the parameter vector. An alternative scaling of thelogistic is to define wi = −(π/

√3)u∗i /σ, whereupon the above density and CDF

will still apply but now k = 1 and β =β∗ · (π/√

3)/σ. This is by far themost commonly used scaling for the logistic distribution and we will call it thecommon Logit model. Note that the estimates and their standard errors willbe rescaled as a result of this choice. Specifically, they will be (π/

√3) = 1.8138

times larger. The likelihood will, of course, be the same for either scaling aswill the usual ratios used for inference.

Aside from the natural appeal of using normally distributed disturbances inthe latent variable model with the Probit approach and the analytic tractabil-ity of the Logit approach, both approaches yield much simpler conditions forconsistency and asymptotic normality.

Theorem 17.4. Suppose the model is either Logit or Probit and (i) B compact,

(ii) Q = E[xx′] nonsingular, then βp−→ β0.

Proof. For both the Logit and Probit models, condition (iii) of Theorem 17.2 isassured by the functional form and we show below that this along with assump-tion (ii) implies the boundedness condition (iv) of Theorem 17.2 is met, so weonly need to verify assumptions (i) and (ii). For the Logit, we use the meanvalue expansion of Fw(w) about w = x′β =0 to obtain

ln(Fw(x′β)) = ln(Fw(0)) + λ(x′β)x′β

where 1 ≥ λ(w) = fw(w)/Fw(w) = 1 − Fw(w) ≥ 0 and β = kβ for 0 ≤ k ≤ 1.

17.3. PROBIT AND LOGIT 253

By the triangle inequality, we have

|ln(Fw(x′β))| ≤ |ln(Fw(0))|+∣∣∣λ(x′β)x

′β∣∣∣ = |ln(Fw(0))|+ λ(x′β) |x′β| since λ(w) ≥ 0

≤ |ln(Fw(0))|+ |x′β| since 1 ≥ λ(w)

≤ |ln(Fw(0))|+ ‖x‖ ‖β‖ by the Cauchy-Schwarz inequality

Since the Logistic is symmetric 1 − Fw(w) = Fw(−w) has the same boundingfunction. Thus, since 0 ≤ y ≤ 1 and likewise 1− y, then

|ln f(y|x;β)| = |y ln(Fw(x′β)) + (1− y) ln(1− Fw(x′β))|≤ 2[|ln(Fw(0))|+ ‖x‖ ‖β‖]≤ 2[|ln(Fw(0))|+ b ‖x‖] = d(z)

where b = supβ∈B ‖β‖ ≥ 0 exists since B compact and ‖β‖ continuous. Andexistence of second moments of x assure this bounding function will have a finiteexpectation.

For the Probit model, we use the same mean value expansion to obtain

ln(Φ(x′β)) = ln(Φ(0)) + λ(x′β)x′β

where λ(w) = ϕ(w)/Φ(w) ≥ 0 and β = kβ for 0 ≤ k ≤ 1. The λ(w) is knownas the inverse Mills ratio and is well studied. In particular, it is convex andasymptotes to 0 as w → ∞ and |w| as w → −∞. Consequently, it is easy tosee that λ(w) ≤ 1 + |w|, since λ(0) = .798. Using the triangle inequality andreplicating the above arguments, we have

|ln(Φ(x′β))| ≤ |ln(Φ(0))|+ λ(x′β) |x′β|

≤ |ln(Φ(0))|+ (1 +∣∣∣x′β∣∣∣) |x′β|

≤ |ln(Φ(0))|+ (1 + ‖x‖∥∥∥β∥∥∥) ‖x‖ ‖β‖ by Cauchy-Schwarz

≤ |ln(Φ(0))|+ (1 + ‖x‖ ‖β‖) ‖x‖ ‖β‖ since β = kβ and 0 ≤ k ≤ 1

And, again, due to symmetry we have

|ln f(y|x;β)| = |y ln(Φ(x′β)) + (1− y) ln(1− Φ(x′β))|≤ 2[|ln(Φ(0))|+ (1 + ‖x‖ ‖β‖) ‖x‖ ‖β‖]≤ 2[|ln(Fw(0))|+ (1 + b ‖x‖)b ‖x‖] = d(z)

where b is the same as for the Logit. Existence of second moments for x insurethat this bounding function has finite expectation.

Theorem 17.5. Suppose the model is either Logit or Probit and (i) β0 interior

B, which is compact, and (ii) Q = E[xx′] nonsingular, then√n(β − β0) −→d

N(0,ϑ−1).


Proof. By Theorem 17.4 assumptions (i) and (ii) imply the consistency andinteriority of condition (i) of Theorem 17.3. Obviously, conditions (ii) and (iii)are satisfied for both models due to the functional forms. For both Logit andProbit, it is straightforward to show that

ϑ(β) = E[∂ ln f(y|x;β)/∂ β· ∂ ln f(y|x;β)/∂β′]

= E[1

Fw(x′β)2(1− Fw(x′β))2(y − Fw(x′β))2fw(x′β)2xx′|x]

= E[fw(x′β)2

Fw(x′β)(1− Fw(x′β))xx′]

= E[λ(x′β)λ(−x′β)xx

′]

For both Logit and Probit, λ(x′β)λ(−x′β) ≥ 0 is bounded from above so the

expectation exists. Moreover, since λ(w)λ(−w) is bounded away from zero forw on any open interval, for w, we can show positive definiteness of Q = E[xx′]implies positive definiteness of ϑ(β) = E[λ(x′β)λ(−x

′)xx

′] and condition (iv)

is satisfied. Let λw(w) = dλ(w)/dw, then we can show that

∂2 ln f(y|x;β))

∂β∂β′= [yλw(x′β) + (1− y)λw(−x

′β)]xx

′

and note that λw(x′β) ≥0 is bounded for both Logit and Probit and hence[yλw(x′β) + (1−y)λw(−x

′β)]≥ 0 is also bounded. Thus

∥∥[yλw(x′β) + (1− y)λw(−x′β)]xx

′∥∥ =

[yλw(x′β) + (1 − y)λw(−x′β)] ‖xx′‖ ≤ c ‖xx′‖ for some c, which has finite ex-

pectation and condition (v) of Theorem 17.3 is satisfied.

Chapter 18

Generalized Method ofMoments

18.1 Method of Moments

18.1.1 Estimation by Analogy

Suppose we are interested in some characteristic or property of the distributionof some random variable z. For example, the mean or some quantile such asthe median. Typically, given knowledge of the distribution, we can write theproperty as a function of the parameters of the distribution. For example, themean of a log-normal distribution is exp(µ + σ2/2),where µ is the location pa-rameter and σ2 is the scale. And estimates of the property may be obtainedby estimating the parameters of the model and plugging them into the functionwhich defines the property of interest. Given sufficient smoothness to the func-tion and regularity of the distribution, then maximum likelihood estimators ofthe parameters will be asymptotically efficient and likewise the asymptoticallyefficient estimator of the property is the function of the maximum likelihoodparameter estimators.

A limitation of this approach is the requirement that we know the underlyingdistribution. If we misspecify the parametric form of the distribution, we mayintroduce bias and inconsistency into our estimators. An alternative would beto base our estimates on the empirical distribution of the random variable takenfrom a sample. The empirical distribution takes the observations (z1, z2, ..., zn)as points of a discrete distribution with probability 1/n assigned to each point.We take the properties of interest of the sample distribution as our estimators.The population mean of this distribution is the sample mean and the populationvariance is the sample variance. Likewise the population quantiles will be thesample quantiles. This approach is known as estimation by analogy and has along tradition in statistics.

The essential appeal of this approach is that it will minimize the extent to

255

256 CHAPTER 18. GENERALIZED METHOD OF MOMENTS

which we need to parameterize the distribution. For example, in a regressionmodel, we will not need to specify the distribution of the disturbances or the re-gressors but just assume certain regularity properties such as i.i.d., existence ofmoments, and/or boundedness. This is a nonparametric or possibly semipara-metric approach which allows us to focus our attention on the part of the modelthat is of economic interest and not have misspecification of the distributioncontaminate our analysis.

18.1.2 Method of Moments Estimation

We consider first a model of an i.i.d. scalar random variable. Suppose we seek toestimate p unknown parameters θ′ = (θ1, θ2, ..., θp) of the distribution fz(z;θ) ofthe random variable Z. The first p population moments of the random variablecan be expressed as p known functions of the p parameters

µ1 = E[Z] = h1(θ1, θ2, ..., θp)

µ2 = E[Z2] = h2(θ1, θ2, ..., θp)

...

µp = E[Zp] = hp(θ1, θ2, ..., θp)

or in vector notation,µ = h(θ).

If this mapping is isomorphic then the inverse function

θ = h−1(µ)

exists and the parameters are identified by the moments.Given a sample of observations (z1, z2, ..., zn), the inverse function suggests

an estimator for the parameters. Let

µj =

n∑t=1

zjt

be the sample j-th moment and µ′ = (µ1, µ2, ..., µp) then our estimator is

θ = h−1(µ).

This estimator is known as a method of moments estimator. It is not known as“the” method of moments estimator because we might have used other momentsthan the first p raw moments.

If the inverse function exists and is continuous at the true parameter valueθ0, then this method of moments estimator will be consistent. Given the i.i.d.assumption and existence of the p moments, then the sample moments willbe consistent estimates of the corresponding population moments, so µ→p µ

0

where µ0 indicates the true values of the population moments. By the continuity

18.1. METHOD OF MOMENTS 257

theorem, the consistency will go through the continuous inverse function, soθ = h−1(µ)→p h

−1(µ0) = θ0.The above approach would seem to be parametric. However, it is often

the case that the distribution can be rewritten to have two sets of parameters:those associated with moments, say θ. and others associated with the shapeof the distribution, say η. The moment conditions above may identify the firstset of parameters, in which case, the second set will be left unspecified. Tothe extent that the identified parameters θ are the same across all possibledistributions, the problem becomes semiparametric with the unspecified shapeof the distribution being nonparametric. We will see the distinction in some ofthe examples below.

18.1.3 Just-Identified Models

This approach generalizes to a vector of random variables and moments of func-tions of the random variables. Suppose z is a k × 1 vector random variable.We have p functions of the data and a p × 1 vector of unknown parameters ofinterest, denoted

gj(z,θ) = gj(z1, z2, . . . , zk; θ1, θ2, . . . , θp)

for j = 1, 2, ...,m. These functions are assumed to have zero mean at the truevalue of the parameters so

E[gj(z,θ0)] = 0

for j = 1, 2, ..., p. In vector notation, we have

E[g(z,θ0)] = 0

where g(z,θ)′ = (g1(z,θ), g2(z,θ), ..., gm(z,θ)). There is no loss of generalityby assuming the zero mean since we can always subtract this component andinclude it in g(z,θ). For example in the previous method of moments equationswe can subtract Zj from hj(θ) in each equation.

Suppose the mean of this function

γ(θ) = E[g(z,θ)]

exists and γ(θ) = 0 has a unique solution at θ = θ0. Since the number ofparameters and equations are equal, we say the model is just-identified by themoment restriction. Then a natural estimator of θ0 is provided by the analogoussolution θ of the sample moments

0 =1

n

n∑t=1

g(zt, θ)

= gn(θ)

where notation is obvious.


The solution, as we will see in the examples below, is sometimes simpleand sometimes not so simple. In particular if the zero condition is linear in θ,the solution is straightforward. If the condition is nonlinear then the solutionmay not be available in closed form and finding it may necessitate iterativeprocedures. The consistency will be easier to establish in the linear cases aswell. General asymptotic behavior for this just-identified approach are coveredas a special case of the more general approach studied in the sequel.

18.1.4 Some Examples

As an simple example of the the advantages of considering moments and thecomplications from nonlinearity, consider the 2-parameter exponential distribu-tion:

f(z) = λe−λ(z−γ)

where γ ≥ 0 is the point at which the distribution begins to have non-zerosupport and λ is a scale parameter. This distribution is widely used in life-to-failure analysis in manufacturing. Maximum likelihood might seem an attractiveoption for estimating the parameters but with one parameter defining the lowerboundry of support the usual assumptions are not satisfied and MLE is not thesolution to first-order conditions. Consequently, the limiting distributions willnot be normal.

An attractive alternative, in this case, is to use moments to identify theparameters and proceed with method of moments. The first two raw momentscan be shown to be

µ1 = µ = γ +1

λ

µ2 = σ2 + µ2 =1

λ2+

(γ +

1

λ

)2

where σ2 is the variance. These moment equations have the unique solution

γ = µ1 −√µ2 − µ2

1

λ = 1/√µ2 − µ2

1.

The method of moments estimators of θ′

= (γ, λ) are obtained by pluggingthe sample raw moments into these nonlinear equations for the population mo-ments. These equations, being nonlinear, will have more complicated finite sam-ple properties. The asymptotic limiting behavior, however, is easily obtainedusing the delta method and the limiting distribution of the raw moments, whichis multivariate normal.

As an example of a linear in the parameters, just-identified model, we con-sider the instrumental variables model from Chapter 11. Our basic stochasticassumption was that w′ = (y,x′, z′) are jointly i.i.d. for x and z (p×1) vectors

18.1. METHOD OF MOMENTS 259

and 0 = E[zu] for u = (y − x′β). Let g(w,θ) = z(y − x′β) for θ then

0 = E[g(w,θ)]

= E[z(y − x′β)]

= E[zy]− E[zx′]β

=∑zy −

∑zx β

= γ(β)

for θ = β,∑zx = E[zx′], and

∑zy = E[zy]. If

∑zx = P is nonsingular as

assumed in chapter 11, then this is a just-identified problem with solution givenby

β =∑−1zx

∑zy.

This example is semiparametric since the shape parameters of the distributionsof z, x, and u are unspecified. We only require existence of the second moments∑zx (nonsingular),

∑zy, and

∑zu = E[zu] = 0.

Using sample analogs, we have the method of moments estimator β as thesolution to

0 =1

n

n∑t=1

g(wt, θ)

=1

n

n∑t=1

[zt(yt − xt′β)]

=1

n

n∑t=1

ztyt −1

n

n∑t=1

ztx′tβ

=1

nZ ′y − 1

nZ ′Xβ

where X, Z, and y are the obvious matrices of observations on x, z, and y.The solution is the IV estimator

β =

(1

nZ′X

)−11

nZ′y

=(Z′X

)−1Z′y

which is seen as a sample analog and will exist with probability one due to thenonsingularity of P . Although gn(θ) is linear in θ = β, the solution is nonlinearin x and zand the finite-sample distribution becomes complicated. The large-sample asymptotic behavior is considered at length in Chapter 11 and falls outas a special case of the overidentified models considered below.


18.2 Generalized Method of Moments (GMM)

18.2.1 Over-Identified Models

Suppose we have more zero mean conditions than the number of parametersbeing estimated. For example, if we have more than p instruments in the IVproblem above. Let m denote the length of the vector g(z,θ) in the zero meancondition

0 = E[g(z,θ0)] (18.1)

and m > p, then we say the model is over-identified. When m = p, the modelis just-identified and the method of moments discussion above applies.

If the elements of g(z,θ) are linearly independent with respect to θ then themethod of moments approach discussed above presents complications. Specifi-cally, we assume the second matrix

Ω(θ) = E[g(z,θ)g(z,θ)′] (18.2)

is nonsingular, since if the covariance were singular we could eliminate some ofthe elements of g(z,θ) as linear combinations of others.

We can attempt, as before to find θ that yield 0 = gn(θ). Unfortunately,this system is over-determined with respect to θ. Although the mean is zeroin the population at θ0, the sample mean will have positive definite covarianceΩ(θ)/n and hence

0 6= λ′gn(θ)

with probability one for any λ 6= 0 and any choice of θ, including even θ0.We can set p of the sample moments to zero but the probablity of any of theremainder being a linear combination of the zero moments is zero.

A possible solution is to discard some of the equations so that m = p andproceed to obtain a just-identified method of moments estimator. The obviousquestion with this approach is which equations to discard and which to retain.Plus, we have learned that additional restrictions on the distribution of estima-tors typically reduces their variability. More will be said about efficiency gainsbelow.

18.2.2 GMM Estimation

Although we cannot set all the sample moments gn(θ) to zero at the same time,it is possible to make them all small, in some sense. Note that gn(θ0) hasmean zero and covariance Ω/n, so by convergence in quadratic mean, it will bearbitrarily close to zero with probability one. In choosing θ that makes gn(θ)small, there will be trade-offs between the elements, making some elementslarger and others smaller as we vary our choice of θ. One approach might beto minimize the sum of squares of gn(θ), namely gn(θ)′gn(θ). A more generalapproach would be to weight these tradeoffs, as in minimizing gn(θ)′Wgn(θ),where W is a positive definite weight matrix.

18.2. GENERALIZED METHOD OF MOMENTS (GMM) 261

Of course some choices of W turn out to better than others. And theoptimal ones will have to be estimated. So we consider the more general mini-

mization problem where the weight matrix is a consistent estimator W of W .

Specifically, treating W as given, we find θ such that

ψn(θ) =1

2gn(θ)′W gn(θ) (18.3)

is minimized soθ = argminθ∈Θ ψn(θ) (18.4)

where, as mentioned above, Θ is the set from which we are choosing. Thisis called the generalized method of moments (GMM) approach and was firstproposed by Lars Hansen. For mathematical convenience and without loss ofgenerality, we halved the criterion funtion, since the minimum will be the samewhether we halve or not.

As with the least squares models (linear and nonlinear), we find the estimatesas the solution to the first-order conditions

0 =∂ψn(θ)

∂θ=

(∂gn(θ)

∂θ′

)′W gn(θ) (18.5)

= Gn(θ)′W gn(θ)

where Gn(θ) = ∂gn(θ)∂θ′ . Note that gn(θ) may be linear in θ, in which case

Gn(θ) = Gn is not a function of θ and first order conditions will also be linear

in θ and easily obtained. We will consider an IV example below that satisfiesthis simplification.

The method of moments estimator falls out as a special case of GMM whenm = p and the model is just-identified by the moment conditions. In thiscase, we can choose θ such that gn(θ) = 0, which obviously minimizes the

criterion function. Note that the weight matrix W has no role in determiningthe estimator for this case, which will be reflected in the limiting covarianceresults below.

18.2.3 Newton-Raphson Method

If gn(θ) is nonlinear in θ, then Gn(θ) will be a function of θ and the first-order

conditions will be nonlinear in θ and finding a solution may require iterativetechniques. In the nonlinear regression case, we linearized the regression func-tion and proceeded to do least squares. Here we will linearize the gn(θ) functionand use the linearized version to do GMM estimation.

First, we expand gn(θ) in a Taylor series (about some initial estimate θi) toobtain

gn(θ) = gn(θi) + Gn(θi)(θ − θi) +Ri

where Ri is the remainder term. We linearize by treating θi as given, definingδ = (θ − θi), and setting the remainder term to zero to obtain

g∗n(δ) = gn(θi) + Gn(θi)δ


Using this linearization in the criterion function, we seek to minimize 12 g∗n(δ)′W g∗n(δ)

with respect to δ. Our estimate δ is given as the solution to the first -orderconditions

0 =

(∂g∗n(δ)

∂δ′

)′W g∗n(δ)

= Gn(θi)′W[gn(θi) + Gn(θi)δ

].

Assuming Gn(θi)′W Gn(θi) has full column rank, then

δ = −(Gn(θi)′W Gn(θi))−1Gn(θi)′W gn(θi).

is the unique solution to the linearized GMM system.Define θi+1 = θi + δ, then we can rearrange the solution for the linearized

model as the iterative equations

θi+1 = θi − (GnW Gn)−1GnW gn

∣∣∣θi

(18.6)

where |θi indicates that Gn and gn have been evaluated at θi. Suppose we

iterate until θi+1 is unchanged from θi, then at convergence, we define θ =θi+1 = θi and

0 = (GnW Gn)−1GnW gn

∣∣∣θ

or, since GnW Gn

∣∣∣θi

nonsingular,

0 = GnW gn

∣∣∣θ

and the first order conditions (18.5) are satisfied.In the literature this procedure for finding a solution to the GMM problem

is known as the Newton-Raphson method. Strictly speaking, Newton’s method,as was shown in Chapter 16, yields the iterations

θi+1 = θi −

(∂2ψn(θi)

∂θ∂θ′

)−1∂ψn(θi)

∂θ,

assuming the second derivatives exist. Now if g(z,θ) is twice continuouslydifferentiable, then for our current problem

∂2ψn(θ)

∂θi∂θ=

(∂2gn(θ)

∂θi∂θ′

)′W gn(θ) + Gn(θ)′W

(∂gn(θ)

∂θi

).

When evaluated at θ0, the first term becomes asymptotically negligible in prob-

ability, since gn(θ0) →p 0, W →p W , and ∂2gn(θ0)/∂θi∂θj = Op(1) forE[∂2g(z,θ0)/∂θi∂θj ] finite. Thus at the target value θ0 we find

∂2ψn(θ0)

∂θ∂θ′− Gn(θ0)′W Gn(θ0)→p 0

18.2. GENERALIZED METHOD OF MOMENTS (GMM) 263

which suggests using the simpler matrix in the iterations. This Newton-Raphsonapproach is the analog to the Gauss-Newton procedure for nonlinear least squareswhere we also eliminated a term in the second derivative matrix since it wasasymptotically negligible at the truth.

All of the practical issues identified for the Gauss-Newton procedure for thenonlinear regression model in Chapter 16 apply equally to the current problemwhen the first-order conditions are nonlinear. Multiple roots or solutions area potential problem. Convergence may be a problem though we know thatconvergence is quadratic near the solution. Good preliminary estimates maygo a long way toward avoiding both of these problems. And using numericderivatives, while attractive, can lead to substantial losses in the precision ofthe estimates. Refer back to Chapter 16 for more information on these issues.

18.2.4 A Simple Over-Identified Example

Consider again the linear instrumental variables problem from Chapter 11, butsuppose that z is an m× 1 vector and m > p. Then we have more instrumentsthan regressors and the model is over-identified. For this model we have

gn(θ) =1

n

n∑t=1

zt(yt − x′tβ)

=1

nZ ′(y −Xβ)

Gn(θ) = − 1

nZ ′X

for θ = β. For some choice of nonsingular W the first-order conditions forGMM become

0 = (− 1

nZ ′X)′W (

1

nZ ′(y −Xβ))

=1

nX ′ZW (

1

nZ ′(y −Xβ))

which has the solution

β = (X ′ZWZ ′X)−1X ′ZWZ ′y.

This estimator obviously differs from the usual IV estimator and equally obvi-

ously depends on the choice of weight matrix W . If we choose W = 1nZ′Z,

then the estimator is the 2SLS estimator.This GMM estimator can be framed as an IV estimator to gain some insight.

Define Z = ZWZ′X then

β = (Z′X)−1Z

′y

and the GMM estimator is seen as an IV estimator with a particular choiceof instruments. The instruments Z = ZA are a linear transformation of the


original instruments Z using the p × m transformation matrix A = WZ′X.Once again an obvious issue, which will be taken up later, is the appropriate

choice of W . For the 2SLS weights, the instruments will be fitted values froma first stage regression of xt on zt.

18.3 Asymptotic Behavior of GMM

18.3.1 Consistency

We now turn to the properties of the estimator that comes from the GMMexercise. The estimator is generally nonlinear, hence the iterative solution to thefirst-order conditions. As a result, the small sample properties of the estimatorare problematic and we must generally rely on the large-sample asymptoticbehavior for purposes of inference.

The following consistency result follows from verification of the high-levelconditions given in Theorem 4.5 from Chapter 4.

Theorem 18.1. Suppose W →p W and (i) θo ∈ Θ compact; (ii) g(z;θ)continuous for θ ∈ Θ with probability 1; (iii) ∃ d(z) with ||g(z;θ)|| < d(z) ∀ θ ∈Θ and E[d(z)] <∞; (iv) W positive definite and E[g(z;θ)] = 0 iff θ = θo, then

θp−→ θo.


Strictly speaking, these low-level conditions are not assured, in general, andshould be verified for the model at hand. The first two assumptions will berelatively innocuous and easily verified for most models. The third assumption,which is used to prove uniform convergence in probability, can be tedious toverify but will not usually present problems. In some cases such as the linearIV problem it is straightforward.

The fourth assumption assures global identifiability of the parameter vectorwithin Θ. This global identification means that we can obtain consistent esti-mates regardless of the starting values used. It may be unrealistic when gn(θ)is nonlinear, since it is possible that Wγ(θ) = 0 has multiple solutions. Thiscondition may be weakened to local identification by substituting the condition:Wγ(θ) = 0 iff θ = θo for all θ in a neighborhood of θo. In this case, consis-tency would follow if the initial value is itself consistent or is assured to be inthe neighborhood specified with probability one.

18.3.2 Asymptotic Normality

The structure of the GMM problem allows us to obtain normality with onlyfirst derivatives assumed. Accordingly, we will not appeal to Theorem 4.6 butinstead prove the result directly.

Theorem 18.2. Suppose W →p W and (i) θp→ θo, θo interior to Θ; (ii)

g(z;θ) continuously differentiable in a neighborhood N of θo; (iii) ∃ D(z) s.t.

18.3. ASYMPTOTIC BEHAVIOR OF GMM 265

∥∥∥∂g(z,θ)∂θ′

∥∥∥ < D(z) for all θ ∈ N and E[D(z)] <∞; (iv) Γ ′WΓ nonsingular for

Γ = E[∂g(z,θ0)∂θ′

], and (v)

√ngn(θ0)

d−→ N(0,Ω);√n(θ−θo) d−→N(0, (Γ

′WΓ )−1Γ ′WΩWΓ(Γ

′WΓ )−1).


We do not need all the assumptions of the consistency proof only interiorityand consistency of θ in (i), which can occur under more general conditions.Condition (ii) is a regularity condition and easily verified. Condition (iii) isanalogous to (ii) in the consistency theorem except now we are talking aboutestimating the matrix Γ consistently. Condition (iv) is needed for the inverse inthe limiting covariance and usually follows from local identification. Condition(v) only requires existence of moments of g(z;θ0) and is needed to invoke thecentral limit theorem and have asymptotic normality.

For the just-identified case, m = p and the matrix Γ is square nonsingu-lar so the limiting covariance matrix simplifies to Γ−1ΩΓ ′−1 = (Γ

′Ω−1Γ )−1,

which does not depend on the weight matrix used and attains the lower bounddeveloped in the next subsection.

18.3.3 Optimal Weight Matrix\

The consistency and normality results above apply for any weight matrix such

that W →p W . The results in Theorem 18.2 allows us to establish the weightmatrix that would minimize, in some sense, the limiting covariance matrix.

When the weight matrix is chosen optimally, then the form of the covariancematrix simplifies considerably. For the present case W = Ω−1, whereupon thecovariance becomes (Γ

′Ω−1Γ )−1, yields the smallest covariance in the sense

given below. Since Ω is positive definite, we can find square upper-triangularU such that Ω = U ′U and Ω−1 = U−1U−1′. Define Γ∗ = U−1′Γ andW ∗ = UWU ′ then the difference

(Γ′Ω−1Γ)−(Γ

′WΓ )(Γ ′WΩWΓ)−1(Γ

′WΓ )

= (Γ′U−1U−1′Γ)− (Γ ′U−1UWU ′U−1′Γ) (18.7)

× (Γ ′U−1UWU ′UWU ′U−1′Γ)−1(Γ′U−1UWU ′U−1′Γ)

= (Γ∗′Γ∗)− (Γ∗

′W ∗Γ∗)(Γ∗

′W ∗W ∗Γ∗)−1(Γ∗

′W ∗Γ∗)

= Γ∗′[Im −W ∗Γ∗(Γ∗′W ∗W ∗Γ∗)−1Γ∗′W ∗]Γ∗

The component in brackets is symmetric idempotent so the entire expressionis positive semidefinite. This means the first matrix in the difference on theleft-hand side of (18.?) exceeds the second by a positive semidefinite matrix.And the inverse of the second, which is the covariance for any W , exceeds thefirst, which is the covariance for W = Ω−1, by a positive semidefinite matrix.

Thus (Γ′Ω−1Γ)

−1is a lower bound for the limiting covariance matrix for the

GMM estimator.


18.3.4 Efficient Two-Step Estimator

In order to obtain the efficient GMM estimator we need a consistent estimatorof Ω = E

[g(z;θ0)g(z;θ0)′

]. A natural estimator is

Ω = Ωn(θ) =

n∑t=1

g(zt; θ)g(zt; θ)′ (18.8)

where θ is a preliminary consistent estimator of θ0. This estimator is consistentand will also be useful in consistently estimating the efficient limiting covariancematrix. We will also require a consistent estimator of Γ, which is provided byGn(θ). Consistency of both is assured by the following theorem.

Theorem 18.3. Suppose the hypotheses of Theorem 18.2 are satisfied, and∃ δ(z) s.t. ‖g(z,θ)‖2 < δ(z) for all θ in a neighborhood N of θ0 and E[δ(z)] <

∞, then Gn(θ)→ Γ and Ωn(θ)→p Ω.


The conditions of this theorem only slightly strengthem those of Theorem 18.2.We can take D(z) from Theorem 18.2 and set δ(z) = D(z)2 for the boundingfunction here and add the assumption that E[D(z)2] <∞.

This leaves us with the task of finding an initial consistent estimator of θ.

A common practice is to set W =W = Im and proceed with GMM with thisweight matrix as a first step, so

θ = argminθ∈Θ

1

2gn(θ)′gn(θ).

By Theorem 18.1 this estimator will be consistent and by Theorem 18.3, Ω basedon it will also be consistent. We next obtain a second step GMM estimator usingΩ as the weight matrix, so

θ = argminθ∈Θ

1

2gn(θ)′Ωn(θ)−1gn(θ).

This two-step estimator will attain the lower bound for limiting covariance.Specifically √

n(θ − θo) d−→ N(0, (Γ′Ω−1Γ )−1).

This estimator is called the two-step efficient GMM estimator.This two-step estimator will depend on the choice of the weight matrix in

the first step. It is possible to avoid this shortcoming by recomputing theestimated weight matrix Ω = Ωn(θ) using the second step estimator θ andthen restimating GMM using this updated estimator. We continue to iteratein this fashion until θ converges, whereupon Ω will also have converged. Thisapproach is called the iterated efficient GMM estimator. Note that the limitingdistribution will be the same as the two-step procedure since Ω will, in the end,still be consistent. It has an advantage in that the estimator is invariant with

18.4. GMM INFERENCE 267

respect to the scale of the data and the initial weighting matrix. And there issome evidence that the iterations result in better finite sample properties.

Another variation is to include the weight matrix in the minimization prob-lem. Define

Ωn(θ) =

n∑t=1

g(zt;θ)g(zt;θ)′

and then take the estimator of θ as

θ = argminθ∈Θ

1

2gn(θ)′Ω−1n (θ)gn(θ).

This estimator is called the continuously updating efficient GMM estimator. Itis somewhat more complicated but has the same limiting distribution as the two-step and iterated GMM estimators. The usual asymptotic limiting distributioncan be thought of as the leading term in an expansion with additional termsthat are of smaller order in n. When account is taken of these higher-orderterms there is some theoretical evidence that the continuously updating GMMestimator performs better. Such an analysis is beyond the scope of this chapter.

18.4 GMM Inference

18.4.1 Standard Normal Tests

First we consider testing a scalar null hypothesis such as H1 : θj = θoj . Now,

the convergent iterate θ will, by Theorem 15.2, yield

√n(θ − θo) d−→ N(0,C−1).

for C = (Γ′WΓ)−1Γ′WΩWΓ(Γ′WΓ)−1. Thus,

√n(θj − θoj )√Cjj

d−→ N(0, 1)

where the jj-th subscript indicates the jj-th element of the matrix. Substitutingthe consistent estimators from Theorem 18.3, we have


d−→ N(0, 1) (18.9)

for C = (Gn(θ)′W Gn(θ))−1Gn(θ)′W ΩW Gn(θ)(Gn(θ)′W Gn(θ))−1 →p C .Under an alternative hypothesis, say H1 : θj = θ1j 6= θoj , we have


=

√n(θj − θ1j )√Cjj

+

√n(θ1j − θoj )√Cjj

= N(0, 1) +

√n(θ1j − θoj )√Cjj

+ op(1)


so we see that the statistic will diverge in the positive/negative direction at therate

√n depending on whether (θ1j − θoj ) is positive/negative. The test will

therefore be consistent.

18.4.2 Wald-Type Tests

Suppose that we seek to test the null hypothesis H0 : r(θ0) = 0 where r(·) isa q × 1 continuously differentiable vector function. The asymptotic normalityresult may be utilized to obtain a quadratic form based test of the Wald type.Using the delta method we find that

√nr(θ) =

√nr(θ0) + R(θ∗)

√n(θ − θ0)

= R(θ0)√n(θ − θ0) + op(1)

d−→N(0,RCR′)

where R(θ) = ∂r(θ)/∂θ′, R = R(θ0), and θ∗ between θ and θ0, so

nr(θ)′[RCR′]−1r(θ)d−→ χ2

q (18.10)

for RCR′ nonsingular. A consistent estimate of C is introduced above, andR = R(θ) will be consistent for R. Thus, under the null hypothesis, we havethe feasible Wald-type statistic

WGMM = nr(θ)′[RCR′]−1r(θ)d−→ χ2

q.

Under the alternative hypothesis H0 : r(θ0) 6= 0, the statistic will diverge inthe positive direction at the rate n.

18.4.3 LR-Type Test

The likelihood-ratio statistic compared the value of the log-likelihood functionat restricted and unrestricted estimates of the parameter. We can construct asimilar statistic using the criterion functions for the GMM case when based ona consistent estimator of the optimal weight matrix W = Ω−1.

First, we must work out the details of the restricted GMM estimator. Define

θ = argminθ∈Θ ψn(θ) s.t. r(θ) = 0

and the corresponding Lagrangian objective function

ϕ(θ,λ) =1

2gn(θ)′W gn(θ) + λ′r(θ)

where λ is a q × 1 vector of Lagrangian multipliers. The first-order conditionsfor optimizing this function, evaluated at the solution θ are

∂ϕ(θ,λ)

∂θ= Gn(θ)′W gn(θ) +R(θ)′λ = 0 (18.11)

∂ϕ(θ,λ)

∂λ= r(θ) = 0. (18.12)


The solution θ is the restricted GMM estimator. An alternative would be touse the restriction to substitute out some of the parameters and then proceedwith the usual GMM approach on the simplified problem. Notice that we used,at least asymptotically, the same weight matrix for both the restricted andunrestricted criterion functions.

In the following, we will define V = Γ′WΓ. Since θ →p θ0, then Gn(θ)→p

Γ, and R = R(θ) →p R = R(θ0). Now premultiply the first equation (18.10)by RV −1 to obtain

0 = RV −1Gn(θ)′W gn(θ) +RV −1R(θ)′λ

which can be solved for λ to yield

λ = −[RV −1R′]−1RV −1Gn(θ)′W gn(θ)

= −[RV −1R′]−1RV −1ΓWgn(θ) + op(1/√n)

Now define gn = gn(θ0), Gn = Gn(θ0), and expand gn(θ) around θ0 to obtain

λ = −[RV −1R′]−1RV −1ΓWgn

− [RV −1R′]−1RV −1ΓWGn(θ − θ0) + op(1√n

)

= −[RV −1R′]−1RV −1ΓWgn

− [RV −1R′]−1R(θ − θ0) + op(1√n

)

= −[RV −1R′]−1RV −1ΓWgn + op(1√n

)

since ΓWGn →p V , (θ − θ0) = Op(1/√n) and 0 = r(θ0), whereupon 0 =

r(θ) = R(θ − θ0) + op(1√n

) .

Expanding the first term in the first equation (18.10) of the first-order con-

ditions around θ yields

Gn(θ)′W gn(θ) = Gn(θ)′W gn(θ) + Gn(θ)′W Gn(θ)(θ − θ) + op(1√n

)

= Gn(θ)′W Gn(θ)(θ − θ) + op(1√n

)

= Γ′WΓ(θ − θ) + op(1√n

)

since Gn(θ)′W gn(θ) = 0. Substituting this and the expression for λ back intothe first equation (18.10) of the first-order conditions yields

0 = Γ′WΓ(θ − θ)− R′[RV −1R′]−1RV −1ΓWgn + op(

1√n

)

= V (θ − θ)−R′[RV −1R′]−1RV −1ΓWgn + op(1√n

).


Solving for (θ − θ) yields

(θ − θ) = V −1R′[RV −1R′]−1RV −1ΓWgn + op(1√n

)

= V −1R′[RV −1R′]−1R(θ − θ0) + op(1√n

) (18.13)

= V −1R′[RV −1R′]−1r(θ) + op(1√n

), (18.14)

since√n(θ−θ0) = V −1ΓW

√ngn+op(1) and r(θ) = R(θ0)(θ−θ0)+op(1/

√n).

Furthermore, we find

√n(θ − θ0) =

[Ip − V −1R′[RV −1R′]−1R

]√n(θ − θ0) + op(1)

→d N(0, [V −1 − V −1R′[RV −1R′]−1RV −1]).

Note that this covariance matrix has rank p − q, which reflects the fact that θsatisfies the q restrictions.

Expanding gn(θ) around θ, we obtain

gn(θ) = gn(θ) + Gn(θ)(θ − θ) + op(1√n

)

which may be substituted into ψn(θ) = 12 gn(θ)′W gn(θ) to yield

ψn(θ) =1

2gn(θ)′W gn(θ) + gn(θ)′W Gn(θ)(θ − θ)

+1

2(θ − θ)′Gn(θ)′W Gn(θ)(θ − θ) + op(

1

n).

Recognizing that the first term on the right-hand side is ψn(θ), the second termis zero, reorganizing, and multiplying by 2n yields

2n(ψn(θ)− ψn(θ)) = n(θ − θ)′Gn(θ)′W Gn(θ)(θ − θ) + op(1)

= n(θ − θ)′Γ′WΓ(θ − θ) + op(1)

= n(θ − θ)′V (θ − θ) + op(1).

Now, substitute for (θ− θ) from above and define LRGMM = 2n[ψn(θ)−ψn(θ)]to obtain

LRGMM = nr(θ)′[RV −1R′]−1RV −1V V −1R′[RV −1R′]−1r(θ) + op(1)

= nr(θ)′[RV −1R′]−1r(θ) + op(1).

The quadratic form above is quite similar to that for the Wald-type statis-tic, except for the matrix in the middle, which is C for WGMM and V −1


for LRGMM . If we use asymptotically efficient weights then W = Ω−1 andC = (Γ′Ω−1Γ)−1 = V −1, whereupon

LRGMM = WGMM + op(1)d−→ χ2

q

And we see that, when based on an asymptotically efficient weight matrix, theLRGMM and WGMM tests are asymptotically equivalent tests and will rejectand not together in large samples, under the null hypothesis. If we do not usethe efficient weight matrix then the two will not be asymptotically equivalentand the LRGMM will not have the posited limiting chi-squared distribution.

One special case of the LR-type test has received enormous attention inthe literature. Suppose the unrestricted model is just-identified then the un-restricted GMM estimator becomes method of moments and is the solution togn(θ) = 0, so ψn(θ) = 0 and LRGMM = nψn(θ)

d−→ χ2q . Now consider the

over-identified unrestricted model above with m equations and p parametersand suppose we add m − p parameters to the equations so that the model isjust-identified and use our model with p parameters as the restricted model.Since we do not need to perform the just-identified unrestricted estimation, wecan treat θ as the restricted estimator and define

J = 2nψn(θ)d−→ χ2

m−p

This is called the J−test and is a test of the all the over-identifying restrictions.There are p parameters and m > p moment conditions. Any p of the momentconditions would identify the parameters so we do not distinguish between iden-tifying and over-identifying moment conditions – all are treated symmetrically.The null hypothesis is that all the moment conditions are true. If any are notsatisfied we expect to get larger values of the statistic.

18.4.4 LM-Type Test

Having established the limiting distribution of the restricted estimator, we candevelop a GMM version of the Lagrange multiplier test in a straightforward fash-ion. In the maximum likelihood case, Lagrange multiplier tests are also calledscore tests and are quadratic forms involving the unrestricted first derivatives ofthe likelihood function evaluated at the restricted estimates. We first considerthe GMM case for a general weight matrix and then the case where the weightmatrix is asymptotically optimal.

Analogous to maximum likelihood, consider the first derivative of the unre-stricted criterion function evaluated at the restricted estimates

∂ψn(θ)

∂θ= Gn(θ)′W gn(θ),

which will generally be nonzero. Now expand around θ to find

gn(θ) = gn(θ) + Gn(θ)(θ − θ) + op(1/√n)

Gn(θ) = Gn(θ) +Op(1/√n)


which can be substituted into the derivative to yield

∂ψn(θ)

∂θ= Gn(θ)′W gn(θ) + Gn(θ)′W Gn(θ)(θ − θ) + op(1/

√n)

= Gn(θ)′W gn(θ) + (Gn(θ)− Gn(θ))′W gn(θ)

+ Gn(θ)′W Gn(θ)(θ − θ)

+ (Gn(θ)− Gn(θ))′W Gn(θ)(θ − θ) + op(1/√n)

= Gn(θ)′W Gn(θ)(θ − θ) + op(1/√n)

since the first term is zero and the second and fourth are op(1/√n). Now

Gn(θ)′W Gn(θ)→p V , so we have

√n∂ψn(θ)

∂θ= V√n(θ − θ) + op(1)

= R′[RV −1R′]−1R√n(θ − θ0) + op(1) (18.15)

→d N(0,R′[RV −1R′]−1RCR′[RV −1R′]−1R)

where we have substituted for (θ − θ) from (18.12).Note that R is q × p so the covariance has rank q < p so this is a singular

multivariate normal distribution. Thus we will form the quadratic using a gen-eralized inverse of the covariance matrix to obtain the Lagrange multiplier teststatistic

LMGMM = n∂ψn(θ)

∂θ′(R′[RV −1R′]−1RCR′[RV −1R′]−1R)+

∂ψn(θ)

∂θ

−→ χ2q.

And the LMGMM test has the same limiting distribution as the WGMM test.Using a generalized inverse effectively eliminates p − q linear dependent ele-ments of ∂ψn(θ)/∂θ′ and forms a quadratic form with the qremaining linearindependent elements.

It can be shown that the LMGMM test is also asymptotically equivalentto the WGMM . First, we observe that V −1R′(RCR′)−1RV −1 is a generalized

inverse of the covariance of√n∂ψn(θ)

∂θ in (18.14). Substituting for the generalized

inverse and√n∂ψn(θ)

∂θ in the quadratic form yields

LMGMM = n∂ψn(θ)

∂θ′(V −1R′(RCR′)−1RV −1)

∂ψn(θ)

∂θ

= n(θ − θ0)′R′[RCR′]−1R(θ − θ0) + op(1)

= nr(θ)′[RCR′]−1r(θ) + op(1)

= WGMM + op(1).

Thus we see that the LMGMM and WGMM tests are asymptotically equivalentand will reject or not together in large samples, under the null hypothesis. Note


that this equivalence holds regardless of the weight matrix used. They willboth be equivalent to the LRGMM test if the weight matrix for both restrictedand unrestricted GMM is asymptotically the optimal choice W = Ω−1, soC = (Γ′Ω−1Γ)−1 = V −1.

The above formulation of the LMGMM test avoids the necessity of finding ageneralized inverse numerically, which can sometimes present problems in prac-tice as generalized inverse algorithms are not as accurate as classical inversionalgorithms. And recall that the quadratic form will be the same for any choiceof generalized inverse. The generalized inverse can be further simplified if theweight matrix is asymptotically optimal so C = V −1. In this case, V −1 is ageneralized inverse and the quadratic form becomes

LMGMM = n∂ψn(θ)

∂θ′V −1

∂ψn(θ)

∂θ→d χ

2q.

A feasible version of this statistic is obtained by substituting the consistentestimator V = V n(θ) as the covariance matrix to yield

LMGMM = n∂ψn(θ)

∂θ′V−1 ∂ψn(θ)

∂θ→d χ

2q.

Given the asymptotic equivalence of the tests one might wonder why allthree have been given consideration. The reasons are the same as they were formaximum likelihood. The Wald-type test only requires unrestricted estimatesand the LM-type test only requires the restricted estimates. If one or the otheris simpler to obtain then the corresponding test is the natural choice. The LR-type test requires both restricted and unrestricted estimates but is invariantwith respect to our choice for writing the restrictions. Plus the J-test falls outas a special case of the LR-type test. A cynical observer might suggest thatall three persist in use because it gives us some choice which allow us to bettermatch our preconceptions.

18.4.5 Robust Covariance Estimation

Up until this point we have considered the case where the underlying z arejointly i.i.d. This assumption is satisfactory in many applications includingcross-section and panel data cases. The limiting covariance estimators for thiscase are asymptotically appropriate and include the White-type heteroskedasticconsistent covariance estimators for the OLS, NLS, and IV estimators as spe-cial cases, among others. Time-series models that otherwise satisfy the i.i.d.assumption will be similarly covered.

In the case of potentially serially dependent data the frame work estab-lished above and all the various results apply with some modification. Thisdoes require additional assumptions and more attention to the estimated opti-mal weight matrix. The proofs on consistency only explicity involved the i.i.d.assumption when Lemma 2.4 of Newey and West was invoked. But this Lemmastill applies if the data are strictly stationary and erogotic. Accordingly, we


replace the assumption that zt are i.i.d. with the assumption that they arestrictly stationary and ergotic. Strict stationarity implies the joint distributionof (zτ , zτ+1, ...,zτ+h) does not depend on τ for any m. Ergodicity means thesample means are consistent: n−1

∑nt=1 a(zt) → E[a(z)] for measurable func-

tions a(z) with E[|a(z)|] <∞. This modification allows the consistency theoremits proof to be applied.

The asymptotic normality result requires some additional concepts whenserial dependence is allowed. Complications arise in sarisfying Condition (v) ofTheorem 18.2. For the i.i.d. case we can appeal to the Lindberg-Levy centrallimit theorem to obtain the result. For the dependent case, we need somemodifications to the notation.

To begin with, the covariance of a single observation on g(zt,θ0) is no longer

the same as the covariance of the limiting distribution of√ngn(θ0). Specifically,

assuming all the needed covariances exist,

√ngn(θ0) =

1√n

n∑t=1

g(zt,θ0)

=1√n

n∑t=1

wt,

for wt = g(zt,θ0), has covariance matrix

Ωn = E

[1√n

n∑t=1

wt1√n

n∑t=1

w′t

]

=1

n

n∑t=1

n∑s=1

E[wtw′s]

=1

n

n∑t=1

E[wtw′t] +

n∑t=1

∑s6=t

E[wtw′s]

.

Now, define Ωt−s = E[wtw′s] as the auto-covariance matrices, where due to

stationarity they only depend on the difference between the indices, then wehave

Ωn =1

n

[nΩ0 +

n∑t=1

n−1∑s=t+1

Ω|t−s| + Ω′|t−s|

]

=1

n

[nΩ0 +

n−1∑s=1

(n− s)(Ωs + Ω′s)

]

= Ω0 +

n−1∑s=1

n− sn

(Ωs + Ω′s)

= Ω0 +

n−1∑s=1

(1− s

n)(Ωs + Ω′s).


This structure reflects the fact that we only have covariances between wτ andwτ−s for τ = s+ 1, ..., n.

Intuitively, a stochastic process wt∞t=−∞ is ergotic if any two collections ofthe random variables partitioned far enough apart are essentially independent.This mean that the covariances in the sums above will be growing smaller as thesubscript grows larger. This is of course required in order for the covariance ofthe normalized partial sum

√ngn(θ0) to be finite. Along these lines, we suppose

the infinite sequence is convergent, so

Ω = limn→∞

Ωn.

In the i.i.d. case Ω = Ω0, the own covariance and the Lindberg-Levy centrallimit theorem applies. In the present context, we will assume that a centrallimit theorem for dependent processes applies and

√ngn(θ0) =

1√n

n∑t=1

wt →d N(0,Ω).

Obtaining this result requires more than strict stationarity and ergodicity. Astronger form of asymptotic independence is required of the sequence such asa-mixing or that it be a martingale difference. We will omit the technical detailsthat justify the above assumption. With this assumption the results of Theorem18.2 will still apply.

In order to conduct inference based on the results of the theorem, we will needconsistent estimates of the components of the covariance matrix. The previouslyintroduced estimator Γ = Gn(θ) will continue to be consistent fo Γ under theassumptions of Theorem 18.2. Whether conducting inference or efficient GMMestimation, we will need a consistent estimator of Ω. The problem is that Ωn

involves the n covariance matrices (Ω0,Ω1, ...,Ωn−1) so consistent estimationof all the covariances is infeasible. Fortunately, these covariances grow smallerwith the size of the index so we can restrict our attention to the first m+ 1. Anestimate of the s−th covariance is provided by

Ωs =1

n− s

n∑t=s+1

(g(zt, θ)g(zt, θ)′)→p Ωs,

which will be consistent under the assumptions of Theorem 18.3. Newey andWest suggest an estimator of Ω based on m+ 1 of these estimated covariances

Ω = Ω0 +

m∑s=1

(1− s

m+ 1)(Ωs + Ω

′s)

where m→∞ but m/n→ 0. They show that this estimator will be consistentunder general conditions which will not be presented here.

An important issue with this covariance estimator is the choice of m, whichis a tuning parameter. There is a tradeoff between increased bias as m is made


smaller and increased variance a m is made larger. In their original work Neweyand West suggested m = o(n1/4) should be satisfied. Recent work on theproblem suggests m = cn1/3 for the optimal choice. Stock and Watson suggestc=.75 as a likely choice. There has been much work on data driven choices ofm, which are beyond the scope of this chapter.

18.5 Example

Keynes conjectured that the marginal propensity to consume (MPC) across in-dividuals would be constant between 0 and 1. He also conjectured that theaverage propensity to consume (APC) would fall at higher income levels. Con-sistent with these conjectures, economist proposed the following linear modelfor consumption:

C = α+ βY

where C is consumption and Y is disposable income. The marginal propensityto consume is dC/dY = β and is constant as income rises while the averagepropensity to consume C/Y = α/Y + β will fall as income rises, providedα > 0. Graphically, for a paarticular income level Y , the MPC is the slope ofthe line C = α+ βY and the APC is the slope of a ray passing from the originpassing through the line for that level.

Dusenberry proposed an empirical specification of this model at the aggre-gate level

Ct = α+ βYt + ut

where ut is an unknown disturbance term. Early empirical studies gave strongsupport to this model, which fit the data very well. Based on cross-sectionaland aggregate data it seemed that income was the primary determinant of con-sumption.

Based on this model and expectations that income would rise following WorldWar II, predictions were that aggregate consumption would grow more slowlythan income as APC began to fall. Consequently, there were predictions ofsecular stagnation and a renewed and prolonged depression following the war asgovernment expenditures were scaled back. These dire predictions were wrong.Consumption rose at much the same rate as income. In fact Kuznets showedthat APC was very stable at the aggregate level from decade to decade. Thiscontradiction of the Keynesian conjecture in aggregate time series is known asthe “consumption puzzle”.

Two early proposals, based on forward-looking consumers, that attemptedto explain this puzzle were the life-cycle hypothesis of Ando, Brumberg, andModigliano and the permanent income hypothesis of Friedman. We will analyzethe later since it will highlight many desirable features of GMM. Friedmanhypothesized that there were two components to current income Y a permanentcomponent Y P , which is expected income, and a transitory component Y T ,which are temporary and unpredictable deviations from the expectationso

Y = Y P + Y T .

18.5. EXAMPLE 277

Friedman argued that consumers will make their consumption decisions on thebasis of permanent income only via the relationship

C = βY P

which yields APC = C/Y = β(Y P /Y ). If high-income households have highertransitory income than low-income households, then APC is lower in high-income housholds. Over the long-run, income variation is due mainly (if notsolely) to variation in permanent income, which implies a stable APC. Notethat there is no intercept in this model.

Friedman considered the following econometric specification

Ct = α+ βY pt + ut

where αis expected to be small. He argued that a proper measure of permanentincome is the present discounted value all current and expected future streamsof income. He further argued that a reasonable predictor of permanent incomeis a weighted average of current and past streams of income with higher weightson recent income. A parsimonious specification that is consistent with this isthe geometric lag specification

Y pt = (1− λ)[Yt + λYt−1 + λ2Yt−2 + λ3Yt−3 + . . .]

for 1 > λ > 0. The problem with this specification is that it introduces a large(infinite?) number of regressor with nonlinear restrictions on their coefficients.

An apparent simplification in the model is achieved by considering the Koycklag transformation

Ct − λCt−1 = (1− λ)α+ β(Y Pt − λY Pt−1) + ut − λut−1= (1− λ)α+ (1− λ)βY Pt + ut − λut−1

or following some rearrangement

Ct = (1− λ)α+ (1− λ)βYt + λCt−1 + ut − λut−1= α∗ + β∗Yt + λCt−1 + u∗t

where notation is obvious. This reparamerization suggests linear regressionanalysis by ordinary least squares, which was the path taken by Friedman. Weare interested in whether or not α∗ = 0 and/or λ = 0. If α∗ > 0 then thepermanent income model will not explain stable APC over time. And if λ = 0,then the permanent income model reduces to the Dusenberry specification.

For out econometric analysis, we utilize quarterly aggregate data from 1952.1to 1961.2, which is relevant for the time of Friedman’s analysis. We begin by per-forming OLS on the Dusenberry specification and obtain (usual homoskedasticestimated standard errors in parenthesis)

Ct = 1.53492(2.51204)

+ 0.87199(0.00789)

Yt


with R2 = 0.99705 and DW = 0.64705. The intercept being nonzer0, but notat all significant with a prob-value of −.54528. The slope coefficient is in thehypothesized range and is significantly less than 1. The Durbin-Watson statisticis in the left-hand 1% tail, which indicates positive first-order serial correlation.The presence of serial correlation and possible heteroskedasticity means thestandard errors are likely misstated. We could obtain standard errors that arerobust against such problems but will proceed on to more complete specificationsfirst.

Next we consider the Koyck transformation and estimate by OLS to obtain(usual homoskedastic estimated standard errors in parenthesis)

Ct = 2.38178(2.14943)

+ 0.43141(0.09939)

Yt + 0.50523(0.11410)

Ct−1

with R2 = 0.99799 and DW = 1.15820. Assuming the classical least squaresassumption are satisfied, then under the null hypothesis that λ = 0, λ/s.e.(λ) =.50523/.1141 = 4.42787 asymptotically has a two-sided prob-value of approx-imately zero. Since λ ≥ 0, then the alternative hypothesis is one-sided andthe one-sided prob-value is still approximately zero. We strongly reject the null.Moreover, the size of λ indicates that there is a substantial contribution of laggedincome to permanent income. the Dusenberry specification looks to be inade-quate. The intercept is an estimate of α∗ and the coefficient on Yt is an estimateof β∗. Using the definitions of β∗ and α∗, we find β = β∗/(1 − λ) = 0.87194

and α = α∗/(1 − λ) = 4.8393, which interestingly enough for β is almost thesame as the bivariate results. We will not calculate standard errors since theregression has obvious complication as set out below.

The problem is that, in the presence of serially correlated errors, a laggeddependent variable can be expected to be correlated with the regressors andthe OLS estimates will be biased and inconsistent. The Durbin-Watson value1.15820 is biased towards zero in the presence of lagged dependent variablesso we construct the Durbin h-test which has a value of 3.6501 and under thenull of no serial correlation, is a realization of a standard normal and so has aprob-value of approximately zero. Moreover, even if the original errors ut wereserially uncorrelated, u∗t = ut − λut−1 will be serially correlated for λ > 0.

The obvious solution to the problems revealed in the previous paragraphis to utilize instrumental variables techniues. The question is what to use asinstruments. With some trepidation, we suppose that the errors in the determi-nation of consumption are orthogonal to the determinates of income. Accord-ingly, the regressor Yt should not present problems and likewise Yt−1. Moreover,Yt−1 is a primary determinate of Ct−1, so is a good choice as an instrument.Thus the regression vector is x′t = (1, Yt, Ct−1) and the instrument vector isz′t = (1, Yt, Yt−1).

Instrumental variable applied to this problem yields (Newey-West standarderrors in parenthesis)

Ct = 1.88517(2.94087)

+ 0.79065(0.20584)

Yt + 0.09187(0.24232)

Ct−1

18.5. EXAMPLE 279

with R2 = 0.99721 and DW = .69252. First notice the much smaller valueof λ and large value of its estimated standard error, which under the null ofλ = 0 yields a ratio of 0.37911, which asymptotically has a one-sided prob-valueof 0.35348. We cannot reject the null hypothesis that λ = 0. Note that theimplied estimates β = β∗/(1− λ) = 0.87063 and α = α∗/(1− λ) = 2.07587, areagain almost the same as the bivariate results. The corresponding Newey-Westestimated standard errors for these transformations are 0.01102 and 3.00932.Testing the null hypothesis that β = 1, yields (β − 1)/s.e.(β) = 11.739, whichasymptotically has a one-sided prob-value of zero. It is significantly less thanone and more than zero. Testing the null that α = 0, yields α/s.e.(α) = 0.68981,which has a two-side prob-value of 0.49499, so we cannot reject the null. Theseare very different results than for least squares for λ and cast serious doubt onthe permanent income hypothesis.

We now consider adding another instrument. For λ > 0, then if the perma-nent income model is correct, then Ct will not only be correlated with Yt butalso Yt−1. This suggests that we can use Yt−1 and Yt−2 as instruments for Ct−1.Thus we now have z′t = (1, Yt, Yt−1, Yt−2) and the problem is overidentified. Wehave four equation

g(wt,θ) =

Ct − [(1− λ)α+ (1− λ)βYt + λCt−1]

Yt(Ct − [(1− λ)α+ (1− λ)βYt + λCt−1])Yt−1(Ct − [(1− λ)α+ (1− λ)βYt + λCt−1])Yt−2(Ct − [(1− λ)α+ (1− λ)βYt + λCt−1])

wherew′t = (Ct, Ct−1, Yt, Yt−1, Yy−2) and θ′ = (α, β, λ). Since GMM is designedto deal with nonlinear problems we estimate α and β directly. Moreover, therewill be no need to use the nonlinear transformation to obtain estimated standarderrors. In view of the serial correlation resulting from the Koyck transformationand the evident serial correlation in the original bivariate regression we willreport Newey-West standard errors.

We report three sets of GMM estimates for this over-identified case plus,for camparison purposes, the above results for the just-identified. The one-stepestimator uses ( 1

nZ′Z)−1 as the weight matrix. This means that the one-step

estimates will be nonlinear 2SLS estimates with the corresponding Newey-Westestimated standard errors. We use these as the starting point for second-stepefficient GMM and iterated efficient GMM estimation. Results are reported inTable 18.1 below.

GMM Procedure α β λ

Just-identified (NLS) 4.81393(5.64892)

0.87194(0.01545)

0.50523(0..09459)

Just-identified (NLIV) 2.07587(3.00932)

0.87063(0.01102)

0.09187(0.24232)

First-step (NL2SLS) 1.54904(3.33215)

0.87142(0.01179)

0.02693(0.25067)

Second-Step (GMM) 3.63698(3.09981)

0.86538(0.01111)

0.07297(0.2433)

Iterated (GMM) 4.86906(3.14683)

0.86101(0.01126)

0.06734(0.24677)


Table 18.1: GMM Results

In looking at the above table, one is immediately struck by how similar theresults are from estimator to estimator, once we correct for the problem of aRHS variable correlated with the errors. The first-step NL2SLS estimates arenot that different from the just-identified IV estimates. Adding the instrumentdid not make much difference for IV estimation. The second-step and iteratedestimates are also not that different from each other, but are little changedfrom the first-step NL2SLS estimates. The estimates of λ are larger but stillnot significantly different from zero, in either case, at any choice of significance.The estimates of α are larger than before but are still insignificantly differentfrom zero. The estimates of β are in every case significantly greater that zeroand less than one.

This analysis should make it clear that the permanent income hypothesis,while possibly explaining the consumption puzzle, has a difficult time explainingthe behavior of the data. If we eliminate Ct−1 as a regressor, as the above resultssuggest, then we are back in the Dusenberry specification, which was also foundwanting relative to the behavior of the data. A different model will have to beused to explain the consumption puzzle. Along the lines of the adequacy of theKoyck lag specification, we have a J-test result for the second-step and interatedGMM estimators. The values are respectively 2.79864 and 2.98906, which for achi-squared with one degree of freedom yield prob-values of 0.09434 and 0.08388.So the omnibus test of model adequacy does not reject the specification. Thisis probably because we have not been creative enough in setting up impliedmoment conditions for the Dusenberry model.

18.A Appendix

We will first prove consistency by verifying the conditions of Theorem 4.5 fromthe Appendix of Chapter 4. For purposes of the proof below, we define ψn(θ) =n−1

∑ni=1(yt − h(xt,θ))2 and ψ0(θ) = E[(y − h(x,θ))2] for θ ∈ Θ.

Proof of Theorem 18.1. Define γ0(θ) = E[g(z,θ)], which is guaranteed to existfor θ ∈ Θ by (iii), and

ψ0(θ) =1

2γ0(θ)′Wγ0(θ).

Now (C1), compactness of Θ, is assured by condition (i). Conditions (ii) and (iii)mean Newey and McFadden’s Lemma 2.4, presented in the Appendix to Chapter5, is satisfied for a(z,θ) = g(z,θ0). Thus γ0(θ) is continuous whereupon (C3)is met and moreover supθ∈Θ ‖gn(z,θ)− γ0(θ)‖ →p 0. Identification condition(C4) follows since, by condition (iv) the quadratic form γ0(θ)′Wγ0(θ) ≥ 0 willbe zero only if θ = θ0.

This leaves us with the uniform convergence condition (C2) to verify. Addingand subtracting γ0(θ) andW inside the quadratic form in ψn(θ) and expanding

18.5. EXAMPLE 281

yields

ψn(θ) =1

2gn(θ)′W gn(θ)

=1

2[(gn(θ)− γ0(θ)) + γ0(θ)]′W [(gn(θ)− γ0(θ)) + γ0(θ)]

+1

2[(gn(θ)− γ0(θ)) + γ0(θ)]′(W −W )[(gn(θ)− γ0(θ)) + γ0(θ)]

=1

2γ0(θ)′Wγ0(θ) + (gn(z,θ)− γ0(θ))′Wγ0(θ))

+1

2(gn(z,θ)− γ0(θ))′W (gn(z,θ)− γ0(θ))

+1

2γ0(θ)′(W −W )γ0(θ) + (gn(z,θ)− γ0(θ))′(W −W )γ0(θ))

+1

2(gn(z,θ)− γ0(θ))′(W −W )(gn(z,θ)− γ0(θ)).

Since Θ compact and γ0(θ) continuous, then γ0(θ) bounded for θ ∈ Θ. By thetriangle and Cauchy-Swartz inequalities we have

|ψn(θ)− ψ0(θ)| ≤ |(gn(z,θ)− γ0(θ))′Wγ0(θ))|

+1

2|(gn(z,θ)− γ0(θ))′W (gn(z,θ)− γ0(θ))|

+∣∣∣(gn(z,θ)− γ0(θ))′(W −W )γ0(θ))

∣∣∣+

∣∣∣∣12(gn(z,θ)− γ0(θ))′(W −W )(gn(z,θ)− γ0(θ))

∣∣∣∣≤ ‖(gn(z,θ)− γ0(θ))‖ ‖W ‖ ‖γ0(θ)‖

+1

2‖(gn(z,θ)− γ0(θ))‖2 ‖W ‖

+ ‖(gn(z,θ)− γ0(θ))‖∥∥∥W −W

∥∥∥ ‖γ0(θ)‖

+1

2‖(gn(z,θ)− γ0(θ))‖2

∥∥∥W −W∥∥∥ .

Since γ0(θ) bounded, (gn(z,θ) − γ0(θ)) converges uniformly in probability to

zero, and W →p W for any θ ∈ Θ, then supθ∈Θ |ψn(θ)− ψ0(θ)| →p 0 and(C2) is satisfied.

We now prove asymptotic normality of the GMM estimator but do not useTherem 4.6.

Proof of Theorem 18.2. By (i) θ ∈ N with probability one and we can expand


gn(θ) the first-order conditions in a Taylor’s series around θ0 to obtain

0 = Gn(θ)′W gn(θ)

= Gn(θ)′W[gn(θ0) + Gn(θ)(θ − θ0)

]= Gn(θ)′W gn(θ0) + Gn(θ)′W Gn(θ)(θ − θ0)

for θ between θ and θ0. Now, by (ii), (iii) and Lemma 2.4, supθ∈N∥∥Gn(θ)− Γ (θ)

∥∥→p

0 for Γ (θ) = E[∂g(z,θ)∂θ′

]. Thus Gn(θ)→p Γ (θ0) = Γ and Gn(θ)′W Gn(θ)→p

Γ ′WΓ which is nonsingular by (iv). So with probability one, we can solve for

(θ − θ0) and scale by√n to obtain

√n(θ − θ0) = −[Gn(θ)′W Gn(θ)]−1Gn(θ)′W

√ngn(θ0)

= −(Γ ′WΓ )−1Γ ′W√ngn(θ0)

+[(Γ ′WΓ )−1Γ ′W − Gn(θ)′W Gn(θ)]−1Gn(θ)′W

]√ngn(θ0)

= (Γ ′WΓ )−1Γ ′W√ngn(θ0) + op(

∥∥√ngn(θ0)∥∥)

since all the estimates in the brackets are consistent for their targets. By (v),√ngn(θ0) = Op(1), so the remainder term is op(1) and we have the result of

the theorem.

We now prove consistency of a couple of covariance component estimators.

Proof of Theorem 18.3. As seen in the previous proof, Condition (iii) of Theo-

rem 18.2 assures Gn(θ)→p Γ (θ0) = Γ . Define Ωn(θ) =∑nt=1 g(zt;θ)g(zt;θ)′

and Ω(θ) = E[g(z;θ)g(z;θ)′], then the local uniform boundedness condition inthis theorem implies supθ∈N ‖Ωn(θ)−Ω(θ)‖ →p 0 for θ in a neighborhood of

θ0. With probability one, then θ lies in the neighborhood so∥∥∥Ωn(θ)−Ω(θ)

∥∥∥→p

0 and Ω = Ω(θ)→p Ω(θ0) = Ω.

nonlinear regression - rice university

Documents