2008_14_chernozhukov

Upload: nhatbo

Post on 05-Apr-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 2008_14_Chernozhukov

    1/64

    Identication and Estimation of Marginal Effects in Nonlinear

    Panel Models 1

    Victor Chernozhukov

    MIT

    Ivan Fern andez-Val

    BU

    Jinyong Hahn

    UCLA

    Whitney Newey

    MIT

    February 4, 2009

    1 First version of May 2007. We thank J. Angrist, B. Graham, and seminar participants of Brown Uni-versity, CEMFI, CEMMAP Microeconometrics: Measurement Matters Conference, CEMMAP Inferencein Partially Identied Models with Applications Conference, CIREQ Inference with Incomplete ModelsConference, Georgetown, Harvard/MIT, MIT, UC Berkeley, USC, 2007 WISE Panel Data Conference,and 2009 Winter Econometric Society Meetings for helpful comments. Chernozhukov, Fern andez-Val,and Newey gratefully acknowledge research support from the NSF.

  • 7/31/2019 2008_14_Chernozhukov

    2/64

    Abstract

    This paper gives identication and estimation results for marginal effects in nonlinear panelmodels. We nd that linear xed effects estimators are not consistent, due in part to marginal

    effects not being identied. We derive bounds for marginal effects and show that they can tightenrapidly as the number of time series observations grows. We also show in numerical calculations

    that the bounds may be very tight for small numbers of observations, suggesting they may beuseful in practice. We propose two novel inference methods for parameters dened as solutions

    to linear and nonlinear programs such as marginal effects in multinomial choice models. Weshow that these methods produce uniformly valid condence regions in large samples. We give

    an empirical illustration.

  • 7/31/2019 2008_14_Chernozhukov

    3/64

    1 Introduction

    Marginal effects are commonly used in practice to quantify the effect of variables on an outcomeof interest. They are known as average treatment effects, average partial effects, and average

    structural functions in different contexts (e.g., see Wooldridge, 2002, Blundell and Powell, 2003).In panel data marginal effects average over unobserved individual heterogeneity. Chamberlain

    (1984) gave important results on identication of marginal effects in nonlinear panel data usingcontrol variable. Our paper gives identication and estimation results for marginal effects in

    panel data under time stationarity and discrete regressors.It is sometimes thought that marginal effects can be estimated using linear xed effects,

    as shown by Hahn (2001) in an example and Wooldridge (2005) under strong independenceconditions. It turns out that the situation is more complicated. The marginal effect may not

    be identied. Furthermore, with a binary regressor, the linear xed effects estimator uses the

    wrong weighting in estimation when the number of time periods T exceeds three. We showthat correct weighting can be obtained by averaging individual regression coefficients, extendinga result of Chamberlain (1982). We also derive nonparametric bounds for the marginal effect

    when it is not identied and when regressors are either exogenous or predetermined conditionalon individual effects.

    The nonparametric bounds are quite simple to compute and to use for inference but canbe quite wide when T is small. We also consider bounds in semiparametric multinomial choice

    models where the form of the conditional probability given regressors and individual effects is

    specied. We nd that the semiparametric bounds can be quite tight in binary choice modelswith additive heterogeneity.

    We also give theorems showing that the bounds can tighten quickly as T grows. We nd

    that the nonparametric bounds tighten exponetially fast when conditional probabilities of certainregressor values are bounded away from zero. We also nd that in a semiparametric logit model

    the bounds tighten nearly that fast without any restriction on the distribution of regressors.These results suggest how the bounds can be used in practice. For large T the nonparametric

    bounds may provide useful information. For small T , bounds in semiparametric models maybe quite tight. Also, the tightness of semiparametric bounds for small T makes it feasible to

    compute them for different small time intervals and combine results to improve efficiency. Toillustrate their usefulness we provide an empirical illustration based on Chamberlains (1984)

    labor force participation example.We also develop estimation and inference methods for semiparametric multinomial choice

    models. The inferential problem is rather challenging. Indeed, the programs that characterizethe population bounds on model parameters and marginal effects are very difficult to use for

    1

  • 7/31/2019 2008_14_Chernozhukov

    4/64

    inference, since the data-dependent constraints are often infeasible in nite samples or undermisspecication, which produces empty set estimates and condence regions. We overcome these

    difficulties by projecting these data-dependent constraints onto the model space, thus producing

    an always feasible data-dependent constraint set. We then propose linear and nonlinear pro-

    gramming methods that use these new modied constraints. Our inference procedures have theappealing justication of targeting the true model under correct specication and targeting thebest approximating model under incorrect specication. We develop two novel inferential pro-

    cedures, one called modied projection and another perturbed bootstrap , that produce uniformlyvalid inference in large samples. These methods may be of substantial independent interest.

    This paper builds on Honore and Tamer (2006) and Chernozhukov, Hahn, and Newey (2004).These papers derived bounds for slope coefficients in autoregressive and static models, respec-

    tively. Here we instead focus on marginal effects and give results on the rate of convergenceof bounds as T grows. Moreover, the identication results in Honore and Tamer (2006) and

    Chernozhukov, Hahn, and Newey (2004) characterize the bounds via linear and non-linear pro-grams, and thus, for the reasons we stated above, they cannot be immediately used for practical

    estimation and inference. We propose new methods for estimation and inference, which are prac-tical and which can be of interest in other problems, and we illustrate them with an empirical

    application.Browning and Carro (2007) give results on marginal effects in autoregressive panel models.

    They nd that more than additive heterogeneity is needed to describe some interesting appli-cation. They also nd that marginal effects are not generally identied in dynamic models.

    Chamberlain (1982) gives conditions for consistent estimation of marginal effects in linear corre-lated random coefficient models. Graham and Powell (2008) extend the analysis of Chamberlain

    (1982) by relaxing some of the regularity conditions in models with continuous regressors.In semiparametric binary choice models Hahn and Newey (2004) gave theoretical and simu-

    lation results showing that xed effects estimators of marginal effects in nonlinear models mayhave little bias, as suggested by Wooldridge (2002). Fern andez-Val (2008) found that averaging

    xed effects estimates of individual marginal effects has bias that shrinks faster as T grows thandoes the bias of slope coefficients. We show that, with small T, nonlinear xed effects consis-

    tently estimates an identied component of the marginal effects. We also give numerical resultsshowing that the bias of xed effects estimators of the marginal effect is very small in a range

    of examples.The bounds approach we take is different from the bias correction methods of Hahn and

    Kuersteiner (2002), Alvarez and Arellano (2003), Woutersen (2002), Hahn and Newey (2004),Hahn and Kuersteiner (2007), and Fern andez-Val (2008). The bias corrections are based on large

    T approximations. The bounds approach takes explicit account of possible nonidentication for

    2

  • 7/31/2019 2008_14_Chernozhukov

    5/64

    xed T . Inference accuracy of bias corrections will depend on T being the right size relative tothe number of cross-section observations n, while inference for bounds does not.

    In Section 2 we give a general nonparametric conditional mean model with correlated unob-

    served individual effects and analyze the properties of linear estimators. Section 3 gives bounds

    for marginal effects in these models and results on the rate of convergence of these bounds asT grows. Section 4 gives similar results, with tighter bounds, in a binary choice model with alocation shift individual effect. Section 5 gives results and numerical examples on calculation of

    population bounds. Section 6 discusses estimation and Section 7 inference. Section 8 gives anempirical example.

    2 A Conditional Mean Model and Linear Estimators

    The data consist of n observations of time series Y i = ( Y i1,...,Y iT ) and X i = [ X i1,...,X iT ] , for adependent variable Y it and a vector of regressors X it . We will assume throughout that ( Y i , X i),

    (i = 1 ,...,n ), are independent and identically distributed observations. A case we consider insome depth is binary choice panel data where Y it {0, 1}. For simplicity we also give someresults for binary X it , where X it {0, 1}.

    A general model we consider is a nonseparable conditional mean model as in Wooldridge

    (2005). Here there is an unobserved individual effect i and a function m(x, ) such that

    E [Y it | X i , i ] = m(X it , i ), (t = 1 ,...,T ). (1)

    The individual effect i may be a vector of any dimension. For example, i could includeindividual slope coefficients in a binary choice model, where Y it {0, 1}, F () is a CDF, and

    Pr( Y it = 1 | X i , i ) = E [Y it | X i , i ] = F (X it i2 + i1).Such models have been considered by Browning and Carro (2007) in a dynamic setting. Morefamiliar models with scalar i are also included. For example, the binary choice model with an

    individual location effect has

    Pr( Y it = 1

    |X i , i) = E [Y it

    |X i , i ] = F (X it

    + i1).

    This model has been studied by Chamberlain (1980, 1984, 1992), Hahn and Newey (2004), and

    others. The familiar linear model E [Y it | X i , i ] = X it + i is also included as a special caseof equation (1).

    For binary X it {0, 1}the model of equation (1) reduces to the correlated random coefficientsmodel of Chamberlain (1982). For other X it with nite support that does not vary with t it is

    a multiple regression version of that model.

    3

  • 7/31/2019 2008_14_Chernozhukov

    6/64

    The two critical assumptions made in equation (1) are that X i is strictly exogenous con-ditional on and that m(x, ) does not vary with time. We consider identication without

    the strict exogeneity assumption below. Without time stationarity, identication becomes moredifficult.

    Our primary object of interest is the marginal effect given by

    0 = [m(x, ) m(x, )]Q(d )

    D,

    where x and x are two possible values for the X it vector, Q denotes the marginal distribution

    of , and D is the distance, or number of units, corresponding to x x. This object gives theaverage, over the marginal distribution, of the per unit effect of changing x from x to x. It is the

    average treatment effect in the treatment effects literature. For example, suppose x = ( x1, x2)where x1 is a scalar, and x = ( x1, x2) . Then D = x1 x1 would be an appropriate distancemeasure and

    0 = [m(x1, x2, ) m(x1, x2, )]Q(d )

    x1 x1,

    would be the per unit effect of changing the rst component of X it . Here one could also consideraverages of the marginal effects over different values of x2.

    For example, consider an individual location effect for binary Y it where m(x, ) = F (x +). Here the marginal effect will be

    0 = D 1 [F (x + ) F (x + )]Q(d ).The restrictions this binary choice model places on the conditional distribution of Y it given X iand i will be useful for bounding marginal effects, as further discussed below.

    In this paper we focus on the discrete case where the support of X i is a nite set. Thus, the

    events X it = x and X it = x have positive probability and no smoothing is required. It wouldalso be interesting to consider continuous X it .

    Linear xed effect estimators are used in applied research to estimate marginal effects. Forexample, the linear probability model with xed effects has been applied when Y it is binary.

    Unfortunately, this estimator is not generally consistent for the marginal effect. There aretwo reasons for this. The rst is the marginal effect is generally not identied, as shown by

    Chamberlain (1982) for binary X it . Second, the xed effects estimator uses incorrect weighting.

    To explain, we compare the limit of the usual linear xed effects estimator with the marginaleffect 0. Suppose that X i has nite support {X 1,...,X K }and let Qk ( ) denote the CDF of the distribution of conditional on X i = X k . Dene

    k = [m(x, ) m(x, )]Qk (d )/D, P k = Pr( X i = X k).4

  • 7/31/2019 2008_14_Chernozhukov

    7/64

    This k is the marginal effect conditional on the entire time series X i = [X i1,...,X iT ] being

    equal to X k . By iterated expectations,

    0 =K

    k=1P kk . (2)

    We will compare this formula with the limit of linear xed effects estimators.

    An implication of the conditional mean model that is crucial for identication is

    E [Y it | X i = X k ] = m(X kt , )Qk (d ). (3)This equation allows us to identify some of the k from differences across time periods of iden-

    tied conditional expectations.To simplify the analysis of the linear xed effect estimator we focus on binary X it {0, 1}.

    Consider w from least squares on

    Y it = X it + i + vit , (t = 1 ,...,T ; i = 1 ,...,n ),

    where each i is estimated. This is the usual within estimator, where for X i = T t=1 X it /T ,

    w =i,t (X it X i )Y iti,t (X it X i)2

    .

    Here the estimator of the marginal effect is just w . To describe its limit, let r k = # {t : X kt =1}/T and 2k = r k (1 r k ) be the variance of a binomial with probability r k .

    Theorem 1: If equation (1) is satised, (X i , Y i ) has nite second moments, and K k=1 P k2k >0, then

    w p

    K k=1 P k2kkK

    k=1 P k2k. (4)

    This result is similar to Angrist (1998) who found that, in a treatment effects model incross section data, the partially linear slope estimator is a variance weighted average effect.

    Comparing equations (2) and (4) we see that the linear xed effects estimator converges to aweighted average of k , weighted by 2k , rather than the simple average in equation (2). The

    weights are never completely equal, so that the linear xed effects estimator is not consistentfor the marginal effect unless how k varies with k is restricted. Imposing restrictions on how

    k varies with k amounts to restricting the conditional distribution of i given X i , which we arenot doing in this paper.

    One reason for inconsistency of w is that certain k receive zero weight. For notationalpurposes let X 1 = (0 , ..., 0) and X K = (1 , ..., 1) (where we implicitly assume that these are

    included in the support of X i ). Note that 21 = 2K = 0 so that 1 and K are not included in

    5

  • 7/31/2019 2008_14_Chernozhukov

    8/64

    the weighted average. The explanation for their absence is that 1 and K are not identied.These are marginal effects conditional on X i equal a vector of constants, where there are no

    changes over time to help identify the effect from equation (3). Nonidentication of these effects

    was pointed out by Chamberlain (1982).

    Another reason for inconsistency of w is that for T 4 the weights on k will be differentthan the corresponding weights for 0. This is because r k varies for k / {1, K }except whenT = 2 or T = 3 .

    This result is different from Hahn (2001), who found that w consistently estimates themarginal effect. Hahn (2001) restricted the support of X i to exclude both (0 , ..., 0) or (1, ..., 1)

    and only considered a case with T = 2. Thus, neither feature that causes inconsistency of wwas present in that example. As noted by Hahn (2001), the conditions that lead to consistency

    of the linear xed effects estimator in his example are quite special.Theorem 1 is also different from Wooldridge (2005). There it is shown that if bi = m(1, i)

    m(0, i ) is mean independent of X it X i for each t then linear xed effects is consistent. Theproblem is that this independence assumption is very strong when X it is discrete. Note that

    for T = 2, X i2 X i takes on the values 0 when X i = (1 , 1) or (0, 0), 1/ 2 when X i = (1 , 0) ,and 1 / 2 when X i = (0 , 1). Thus mean independence of bi and X i2 X i actually implies that2 = 3 and that these are equal to the marginal effect conditional on X i {X 1, X 4}. Thisis quite close to independence of bi and X i , which is not very interesting if we want to allow

    correlation between the regressors and the individual effect.The lack of identication of 1 and K means the marginal effect is actually not identied.

    Therefore, no consistent estimator of it exists. Nevertheless, when m(x, ) is bounded there areinformative bounds for 0, as we show below.

    The second reason for inconsistency of w can be corrected by modifying the estimator.In the binary X it case Chamberlain (1982) gave a consistent estimator for the identied effect

    I = K 1k=2 P kk / K 1k=2 P k . The estimator is obtained from averaging across individuals theleast squares estimates of i in

    Y it = X it i + i + vit , (t = 1 ,...,T ; i = 1 ,...,n ),

    For s2xi =

    T t=1 (X it X i)

    2and n

    =ni=1 1(s

    2xi > 0), this estimator takes the form

    =1

    n

    n

    i=1

    1(s2xi > 0)T t=1 (X it X i )Y it

    s2xi.

    This is equivalent to running least squares in the model

    Y it = kX it + k + vit , (5)

    6

  • 7/31/2019 2008_14_Chernozhukov

    9/64

    for individuals with X i = X k , and averaging k over k weighted by the sample frequencies of X k .

    The estimator of the identied marginal effect I can easily be extended to any discreteX it . To describe the extension, let dit = 1( X it = x), dit = 1( X it = x), r i = T t=1 dit /T, r i =

    T t=1 dit /T , and n= ni=1 1(r i > 0)1(r i > 0). The estimator is given by

    =1

    n

    n

    i=1

    1(r i > 0)1(r i > 0)[T t=1 dit Y it

    T r i T t=1 dit Y it

    T r i].

    This estimator extends Chamberlains (1982) estimator to the case where X it is not binary.To describe the limit of the estimator in general, let K= {k :there is t and t such that

    X kt

    = x and X kt = x}. This is the set of possible values for X i where both x and x occur forat least one time period, allowing identication of the marginal effect from differences. For all

    other values of k, either x or x will be missing from the observations and the marginal effectwill not be identied. In the next Section we will consider bounds for those effects.

    Theorem 2: If equation (1) is satised, (X i , Y i ) have nite second moments and kK P k >0, then

    pI =kK

    P k k ,where P k = P k / kK P k .

    Here is not an efficient estimator of I for T 3, because is least squares over time,which does not account properly for time series heteroskedasticity or autocorrelation. An efficient

    estimator could be obtained by a minimum distance procedure, though that is complicated. Also,one would have only few observations to estimate needed weighting matrices, so its propertiesmay not be great in small to medium sized samples. For these reasons we leave construction of

    an efficient estimator to future work.To see how big the inconsistency of the linear estimators can be we consider a numerical

    example, where X it {0, 1}is i.i.d across i and t, Pr( X it = 1) = pX , it is i.i.d. N (0, 1),Y it = 1( X it + i + it > 0), i = T (X i pX )/ pX (1 pX ).

    Here we consider the marginal effect for x = 1 , x = 0 , D = 1 , given by

    0 = [(1 + ) ()]Q(d ).Table 1 and Figure 1 give numerical values for ( w 0)/ 0 and ( 0)/ 0 for several valuesof T and pX , where w = plim w and = plim .

    We nd that the biases (inconsistencies) can be large in percentage terms. We also nd that

    biases are largest when pX is small. In this example, the inconsistency of xed effects estimators

    7

  • 7/31/2019 2008_14_Chernozhukov

    10/64

    of marginal effects seems to be largest when the regressor values are sparse. Also we nd thatdifferences between the limits of and w are larger for larger T , which is to be expected due

    to the weights differing more for larger T .

    3 Bounds in the Conditional Mean Model

    Although the marginal effect 0 is not identied it is straightforward to bound it. Also, as wewill show below, these bounds can be quite informative, motivating the analysis that follows.

    Some additional notation is useful for describing the results. Let

    mkt = E [Y it | X i = X k ]/Dbe the identied conditional expectations of each time period observation on Y it conditional onthe kth support point. Also, let ( ) = [ m(x, )

    m(x, )] /D. The next result gives identi-

    cation and bound results for k , which can then be used to obtain bounds for 0.

    Lemma 3: Suppose that equation (1) is satised. If there is t and t such that X kt

    = x and

    X kt = x then k = mkt mkt .

    Suppose that B m(x, )/D Bu . If there is t such that X kt = x then mkt Bu k mkt B .

    Also, if there is tk such that X kt = x then

    B mkt k Bu mkt .Suppose that ( ) has the same sign for all . Then if for some k there is t and t such that

    X kt

    = x and X kt = x, the sign of ( ) is identied. Furthermore, if ( ) is positive then thelower bounds may be replaced by zero and if ( ) is negative then the upper bounds may be

    replaced by zero.

    The bounds on each k can be combined to obtain bounds for the marginal effect 0. Let

    K = {k : there is t such that X kt = x but no t such that X kt = x},K = {k : there is t such that X kt = x but no t such that X kt = x}.

    Also, let P 0(x) = Pr( X i : X it = x and X it = x t). The following result is obtained bymultiplying the kth bound in Lemma 4 by P k and summing.

    8

  • 7/31/2019 2008_14_Chernozhukov

    11/64

    Theorem 4 : If equation (1) is satised and B m(x, )/D Bu then 0 u for = P 0(B Bu ) +

    kKP k (mkt Bu ) +

    kKP k (B mkt ) +

    kKP kk ,

    u =

    P 0(Bu

    B ) +

    kK P k (mkt

    B ) +

    kK P k (Bu

    mkt ) +

    kK P kk .

    If ( ) has the same sign for all and there is some k such that X k

    t= x and X k

    t = x, the

    sign of 0 is identied, and if 0 > 0 (< 0) then ( u ) can be replaced by kK P kkAn estimator can be constructed by replacing the probabilities by sample proportions P k =

    i 1(X i = X k )/n and P 0 = 1 kK P k kK P k kK P k , and each mkt by

    mkt = 1( nk > 0)

    n

    i=11(X i = X k )Y it /n k , n k =

    n

    i=11(X i = X k).

    Estimators of the upper and lower bound respectively are given by

    = P 0(B Bu ) +kK

    P k (mkt Bu ) +kK

    P k (B mkt ) + ( n/n ),

    u = P 0(Bu B ) +kK

    P k (mkt B ) +kK

    P k (Bu mkt ) + ( n/n ).

    The bounds and u will be jointly asymptotically normal with variance matrix that can beestimated in the usual way, so that set inference can be carried out as described in Chernozhukov,

    Hong, and Tamer (2007), or Beresteanu and Molinari (2008).As an example, consider the binary X case where X it {0, 1}, x = 1 , and x = 0. Let X K

    denote a T 1 unit vector and X 1 be the T 1 zero vector, assumed to lie in the support of X i . Here the bounds will be

    = P K (mK t Bu ) + P 1(B m1t ) +1

  • 7/31/2019 2008_14_Chernozhukov

    12/64

    This result gives conditions for identication as T grows, generalizing a result of Chamberlain(1982) for binary X it . In addition, it shows that the bounds derived above shrink to the marginal

    effect as T grows. The rate at which the bounds converge in the general model is a complicated

    question. Here we will address it in an example and leave general treatment to another setting.

    The example we consider is that where X it {0, 1}.

    Theorem 6: If equation (1) is satised , B m(x, )/D Bu and X i is stationary and Markov of order J conditional on i then for p1i = Pr( X it = 0 |X i,t 1 = = X i,t J = 0 , i)and pK i = Pr( X it = 1 |X i,t 1 = = X i,t J = 1 , i )

    max{| 0|, |u 0|} (Bu B )E [( p1i )T J + ( pK i )T J ].If there is > 0 such that p1i 1 and pK i 1 then

    max{| 0|, |u 0|} (Bu B )2(1 )T J .If there is a set Aof i such that Pr( A) > 0 and either Pr( X i1 = = X iJ = 0 | i ) > 0 for i Aand p1i = 1 for all i A, or Pr( X i1 = = X iJ = 1 | i ) > 0 for i Aand pK i = 1 , then 0, or u 0.

    When the conditional probabilities that X it is zero or one are bounded away from one thebounds will converge at an exponential rate. We conjecture that an analogous result could be

    shown for general X it . The conditions that imply that one of the bounds does not converge

    violates a hypothesis of Theorem 5, that the conditional support of X it equals the marginalsupport. Theorem 6 shows that in this case the bounds may not shrink to the marginal effect.

    The bounds may converge, but not exponentially fast, depending on P ( i ) and the distri-

    bution of i . For example, suppose that X it = 1( i it > 0), iN (0, 1), it N (0, 1), with i i.i.d. over i, and it i.i.d. over t and independent of i . Then

    P K = E [( i )T ] = ()T ()d = [ ( )T +1

    T + 1]

    +

    =

    1T + 1

    .

    In this example the bounds will converge at the slow rate 1 /T . More generally, the convergence

    rate will depend on the distribution of p1i and pK i .It is interesting to note that the convergence rates we have derived so far depend only on

    the properties of the joint distribution of ( X i , i ), and not on the properties of the conditionaldistribution of Y i given (X i , i ). This feature of the problem is consistent with us placing no

    restrictions on m(x, ). In Section 5 we nd that the bounds and rates may be improved whenthe conditional distribution of Y i given (X it , i ) is restricted.

    10

  • 7/31/2019 2008_14_Chernozhukov

    13/64

    4 Predetermined Regressors

    The previous bound analysis can be extended to cases where the regressor X it is just prede-termined instead of strictly exogenous. These cases cover, for example, dynamic panel models

    where X it

    includes lags of Y it

    . To describe this extension let X i(t) = [ X

    i1,...,X

    it] and suppose

    that

    E [Y it |X i (t), i] = m(X it , i ), (t = 1 ,...,T ). (7)For example, this includes the heterogenous, dynamic binary choice model of Browning and

    Carro (2007), where Y it {0, 1}and X it = Y i,t 1.As before, the marginal effect is given by 0 = [m(x, )m(x, )]Q

    (d )/D for two different

    possible values x and x of the regressors and a distance D . Also, as before, the marginal effectwill have an identied component and an unidentied component. The key implication that is

    used to obtain the identied component is

    E [Y it |X i (t) = X (t)] = m(X t (t), )Q(d |X i (t) = X (t)) , (8)where X (t) = [ X 1,...,X t ] .

    Bounds are obtained by partitioning the set of possible X i into subsets that can makeuse of the above key implication and a subset where bounds on m(x, )/D are applied. The

    key implication applies to subsets of the form X t (x) = {X : X t = x, X s = x s < t },that is a set of possible X i vectors that have x as the t th component and not as any previous

    components. The bound applies to the same subset as before, that where x never appears, given

    by X (x) = {X : X t = x t}. Together the union of X t (x) over all t and X (x) constitute apartition of possible X vectors. Let P (x) = Pr( X iX (x)) be the probability that none of thecomponents of X i is equal to x and

    0 = E [T

    t=1{1(X iX t (x)) 1(X iX t (x))}Y it ]/D.

    Then the key implication and iterated expectations give

    Theorem 7 : If equation (7) is satised and B m(x, )/D Bu then 0 u for = 0 + B P (x) Bu P (x), u = 0 + Bu P (x) B P (x). (9)

    As previously, estimates of these bounds can be formed from sample analogs. Let P (x) =ni=1 1(X i

    X (x)) /n and

    =n

    i=1

    T

    t=1[1(X iX t (x)) 1(X iX t (x))]Y it / (nD ).

    11

  • 7/31/2019 2008_14_Chernozhukov

    14/64

    The estimates of the bounds are given by

    = + B P (x) Bu P (x), u = + Bu P (x) B P (x).Inference using these bounds can be carried out analogously to the strictly exogenous case.

    An important example is binary Y it {0, 1}where X it = Y i,t 1. Here Bu = 1 and B = 0 ,so the marginal effect is

    0 = [Pr( Y it = 1 |Y i,t 1 = 1 , ) Pr( Y it = 1 |Y i,t 1 = 0 , )]Q(d ),i.e., the effect of the lagged Y i,t 1 on the probability that Y it = 1 , holding i constant, averaged

    over i . In this sense the bounds provide an approximate solution to the problem consideredby Feller (1943) and Heckman (1981) of evaluating duration dependence in the presence of

    unobserved heterogeneity. In this example the bounds estimates are

    = P (0) , u = + P (1) . (10)The width of the bounds is P (0)+ P (1), so although these bounds may not be very informative

    in short panels, in long panels, where P (0) + P (1) is small, they will be.Theorems 5 and 6 on convergence of the bounds as T grows apply to and u from equation

    (9), since the bounds have a similar structure and the convergence results explicitly allow fordependence over time of X it conditional on i . For example, for Y it {0, 1}and X it = Y i,t 1,equation (7) implies that Y it is Markov conditional on i with J = 1 . Theorem 5 then shows thatthe bounds converge to the marginal effect as T grows if 0 < Pr( Y it = 1

    | i) < 1 with probability

    one. Theorem 6 also gives the rate at which the bounds converge, e.g. that will be exponentialif Pr( Y it = 1 |Y i,t 1 = 1 , i ) and Pr( Y it = 0 |Y i,t 1 = 0 , i ) are bounded away from one.

    It appears that, unlike the strictly exogenous case, there is only one way to estimate theidentied component 0. In this sense the estimators given here for the bounds should be

    asymptotically efficient, so there should be no gain in trying to account for heteroskedasticity

    and autcorrelation over time. Also, it does not appear possible to obtain tighter bounds whenmonotonicity holds, because the partition is different for x and x

    5 Semiparametric Multinomial Choice

    The bounds for marginal effects derived in the previous sections did not use any functionalform restrictions on the conditional distribution of Y i given (X i , i ). If this distribution is

    restricted one may be able to tighten the bounds. To illustrate we consider a semiparametricmultinomial choice model where the conditional distribution of Y i given (X i , i) is specied and

    the conditional distribution of i given X i is unknown.

    12

  • 7/31/2019 2008_14_Chernozhukov

    15/64

    We assume that the vector Y i of outcome variables can take J possible values Y 1, . . . , Y J .As before, we also assume that X i has a discrete distribution and can take K possible values

    X 1, . . . , X K . Suppose that the conditional probability of Y i given (X i , i ) is

    Pr( Y i = Y j

    |X i = X k , i ) =

    L(Y j

    |X k , i , )

    for some nite dimensional and some known function L. Let Qk denote the unknown condi-tional distribution of i given X i = X k . Let P jk denote the conditional probability of Y i = Y jgiven X i = X k . We then have

    P jk = L Y j | X k , , Qk (d ) , ( j = 1 ,...,J ; k = 1 ,...,K ), (11)where P jk is identied from the data and the right hand side are the probabilities predictedby the model. This model is semiparametric in having a likelihood Lthat is parametric andconditional distributions Q

    k for the individual effect that are completely unspecied. In generalthe parameters of the model may be set identied, so the previous equation is satised by a setof values B that includes and a set of distributions for Qk that includes Qk for k = 1 ,...,K.

    We discuss identication of model parameters more in detail in next section. Here we will focuson bounds for the marginal effect when this model holds.

    For example consider a binary choice model where Y it {0, 1}, Y i1,...,Y iT are independentconditional on ( X i , i ), and

    Pr( Y it = 1 | X i , i , ) = F (X it + i ) (12)for a known CDF F (). Then each Y j consists of a T 1 vector of zeros and ones, so withJ = 2 T possible values. Also,

    L(Y i | X i , i , ) =T

    t=1F (X it + i)

    Y it [1F (X it + i)]1 Y it .

    The observed conditional probabilities then satisfy

    P jk =

    T

    t=1F (X kt

    + )Y j

    t [1F (X kt + )]1 Y j

    t Qk (d ) , ( j = 1 , ..., 2T ; k = 1 ,...,K ).

    As discussed above, for the binary choice model the marginal effect of a change in X it fromx to x, conditional on X i = X k , is

    k = D 1 [F x + F x + ]Qk (d ), (13)for a distance D. This marginal effect is generally not identied. Bounds can be constructed

    using the results of Section 3 with B = 0 and Bu = 1 , since m(x, ) = F (x + )[0, 1].

    13

  • 7/31/2019 2008_14_Chernozhukov

    16/64

    Moreover, in this model the sign of ( ) = D 1[F (x + ) F (x + )] does not changewith , so we can apply the result in Lemma 3 to reduce the size of the bounds. These bounds,

    however, are not tight because they do not fully exploit the structure of the model. Sharperbounds are given by

    k = min B,Q k D 1

    [F (x + ) F (x + )]Qk (d )s.t. P jk = L Y j | X k , , Qk (d ) j,

    (14)

    andk = max B,Q k D 1 [F (x + ) F (x + )]Qk (d )s.t. P jk = L Y

    j | X k , , Qk (d ) j.(15)

    In the next sections we will discuss how these bounds can be computed and estimated. Here we

    will consider how fast the bounds shrink as T grows.First, note that since this model is a special case of (more restricted than) the conditional

    mean model, the bounds here will be sharper than the bounds previously given. Therefore, thebounds here will converge at least as fast as the previous bounds. Imposing the structure here

    does improve convergence rates. In some cases one can obtain fast rates without any restrictions

    on the joint distribution of X i and i .We will consider carefully the logit model and leave other models to future work. The logit

    model is simpler than others because is point identied. In other cases one would need toaccount for the bounds for . To keep the notation simple we focus on the binary X case,

    X it {0, 1}, where x = 1 and x = 0. We nd that the bounds shrink at rate T r for any niter, without any restriction on the joint distribution of X i and i .

    Theorem 8: For k = 1 or k = K and for any r > 0, as T ,k k = O(T r ).

    Fixed effects maximum likelihood estimators (FEMLEs) are a common approach to estimate

    model parameters and marginal effects in multinomial panel models. Here we compare theprobability limit of these estimators to the identied sets for the corresponding parameters.

    The FEMLE treats the realizations of the individual effects as parameters to be estimated. Thecorresponding population problem can be expressed as

    = argmax K

    k=1P k

    J

    j =1P jk logL Y j | X k , jk ( ), , (16)

    where

    jk ( ) = argmax log L Y j | X k , , , j, k. (17)

    14

  • 7/31/2019 2008_14_Chernozhukov

    17/64

    Here, we rst concentrate out the support points of the conditional distributions of and thensolve for the parameter .

    Fixed effects estimation therefore imposes that the estimate of Qk has no more than J points

    of support. The distributions implicitly estimated by FE take the form

    Qk ( ) = P jk , for = jk ( );0, otherwise.

    (18)

    The following example illustrates this point using a simple two period model. Consider a two-

    period binary choice model with binary regressor and strictly increasing and symmetric CDF,i.e., F (x) = 1 F (x). In this case the estimand of the xed effects estimators are

    jk ( ) =, if Y j = (0 , 0); (X k1 + X k2 )/ 2, if Y j = (1 , 0) or Y j = (0 , 1);

    , if Y j

    = (1 , 1),

    (19)

    and the corresponding distribution for has the form

    Qk ( ) =Pr{Y = (0 , 0) | X k}, if = ;Pr{Y = (1 , 0) | X k}+ Pr {Y = (0 , 1) | X k}, if = (X k1 + X k2 )/ 2;Pr{Y = (1 , 1) | X k}, if = .

    (20)

    This formulation of the problem is convenient to analyze the properties of nonlinear xed

    effects estimators of marginal effects. Thus, for example, the estimator of the marginal effect k

    takes the form:k ( ) = D 1 [F (x + ) F (x + )]Qk ( ). (21)

    The average of these estimates across individuals with identied effects is consistent for theidentied effect I when X is binary. This result is shown here analytically for the two-period

    case and through numerical examples for T 3.Theorem 9: If F (x) > 0, F (x) = 1 F (x), and K 1k=1 P k > 0, then, for P k =

    P k / K 1k=2 P k ,

    I =

    K 1

    k=2 P

    k k ( ) = I .

    For not identied effects the nonlinear xed effects estimators are usually biased toward zero,

    introducing bias of the same direction in the xed effect estimator of the average effect 0 if there are individuals with not identied effects in the population. To see this consider a logit

    15

  • 7/31/2019 2008_14_Chernozhukov

    18/64

    model with binary regressor, X k = (0 , 0), x = 0 and x = 1. Using that = 2 (Andersen,1973) and F (x) = F (x)(1 F (x)) 1/ 4, we have

    k ( ) = F ( ) F (0) [P {Y = (1 , 0) | X k}+ P {Y = (0 , 1) | X k}]

    / 2 F ()F (1 )Qk (d ) = E [ F (x + ) | X = X k ] |k| .This conjecture is further explored numerically in the next section.6 Characterization and Computation of Population Bounds

    6.1 Identication Sets and Extremal Distributions

    We will begin our discussion of calculating bounds by considering bounds for the parameter . Let

    L jk (, Q k) :=

    LY j

    |X k , , Qk (d ) and Q := ( Q1, . . . , Q K ). For the subsequent

    inferential analysis, it is convenient to consider a quadratic loss function

    T (, Q ;P ) = j,k

    jk (P ) (P jk L jk (, Q k )) 2 , (22)

    where jk (P ) are positive weights. By the denition of the model in (11), we can see that( , Q) is such that

    T (, Q ;P ) T ( , Q;P ) = 0 ,for every (, Q ). For T ( ;P ) := inf Q T (, Q ;P ), this implies that

    T ( ;P ) T ( ;P ) = 0 ,for every . Let B be the set of s that minimizes T ( ;P ), i.e.,

    B := { : T ( ;P ) = 0 }.Then we can see that B . In other words,

    is set identied by the set B .

    It follows from the following lemma that one needs only to search over discrete distributionsfor Q to nd B . Note that

    Lemma 10: If the support C of i is compact and L Y j | X k , , is continuous in for each , j, and k, then, for each B and k, a solution to

    Qk = argminQk

    J

    j =1 jk (P ) (P jk L jk (, Q k ))2

    exists that is a discrete distribution with at most J points of support, and L jk (, Q k ) = P jk , j, k.

    16

  • 7/31/2019 2008_14_Chernozhukov

    19/64

    Another important result is that the bounds for marginal effects can be also found by search-ing over discrete distributions with few points of support. We will focus on the upper bound kdened in (15); an analogous result holds for the lower bound k in (14).

    Lemma 11: If the support C of i is compact and

    LY j

    |X k , , is continuous in for

    each , j, and k, then, for each B and k, a solution to

    Qk = argmaxQk

    D 1 [F (x + ) F (x + )]Qk (d ) s.t. L jk (, Q k ) = P jk , jcan be obtained from a discrete distribution with at most J points of support.

    6.2 Numerical Examples

    We carry out some numerical calculations to illustrate and complement the previous analytical

    results. We use the following binary choice model

    Y it = 1 {X it + i + it 0}, (23)with it i.i.d. over t normal or logistic with zero mean and unit variance. The explanatoryvariable X it is binary and i.i.d. over t with pX = Pr {X it = 1 }= 0 .5. The unobserved individualeffect i is correlated with the explanatory variable for each individual. In particular, we generatethis effect as a mixture of a random component and the standardized individual sample mean

    of the regressor. The random part is independent of the regressors and follows a discretized

    standard normal distribution, as in Honore and Tamer (2006). Thus, we have

    i = 1i + 2i ,

    where

    Pr { 1i = am }= a m +1 + a m2 , for am = 3.0; a m +1 + a m2 a m + a m 12 , for am = 2.8, 2.6,..., 2.8;1 a m + a m 12 , for am = 3 .0;

    and 2i = T (X i pX )/ pX (1 pX ) with X i =T t=1 X it /T.

    Identied sets for parameters and marginal effects are calculated for panels with 2, 3, and 4

    periods based on the conditional mean model of Section 2 and semiparametric logit and probitmodels. For logit and probit models the sets are obtained using a linear programming algorithm

    for discrete regressors, as in Honore and Tamer (2006). Thus, for the parameter we have that

    17

  • 7/31/2019 2008_14_Chernozhukov

    20/64

    B = { : L( ) = 0 }, where

    L( ) = minwk ,vjk , km

    K

    k=1

    wk +J

    j =1

    K

    k=1

    v jk (24)

    v jk +M m =1 km L Y

    j

    | X k

    , m , = P jk j, k,wk + M m =1 km = 1 k,v jk 0, wk 0, km 0 j, k, m.

    For marginal effects, see also Chernozhukov, Hahn, and Newey (2004), we solve

    k / k = max / minkm , B

    M

    m =1km [F (x + m ) F (x + m )] (25)

    M m =1 km L Y j | X k , m , = P jk j,M m =1 km = 1 , km

    0

    j, m.

    The identied sets are compared to the probability limits of linear and nonlinear xed effects

    estimators.Figure 2 shows identied sets for the slope coefficient in the logit model. The gures

    agree with the well-known result that the model parameter is point identied when T 2, e.g.,Andersen (1973). The xed effect estimator is inconsistent and has a probability limit that is

    biased away from zero. For example, for T = 2 it coincides with the value 2 obtained byAndersen (1973). For T > 2, the proportionality = c for some constant c breaks down.

    Identied sets for marginal effects are plotted in Figures 3 7, together with the probabilitylimits of xed effects maximum likelihood estimators (Figures 4 6) and linear probability model

    estimators (Figure 7). 1 Figure 3 shows identied sets based on the general conditional meanmodel. The bounds of these sets are obtained using the general bounds (G-bound) for binary

    regressors in (6), and imposing the monotonicity restriction on ( ) in Lemma 3 (GM-bound).In this example the monotonicity restriction has important identication content in reducing

    the size of the bounds.Figures 4 6 show that marginal effects are point identied for individuals with switches in

    the value of the regressor, and nonlinear xed effects estimators are consistent for these effects.

    This numerical nding suggests that the consistency result for nonlinear xed effects estimatorsextends to more than two periods. Unless = 0, marginal effects for individuals without

    switches in the regressor are not point identied, which also precludes point identication of the average effect. Nonlinear xed effects estimators are biased toward zero for the unidentied

    effects, and have probability limits that usually lie outside of the identied set. However, both1 We consider the version of the linear probability model that allows for individual specic slopes in addition

    to the xed effects.

    18

  • 7/31/2019 2008_14_Chernozhukov

    21/64

    the size of the identied sets and the asymptotic biases of these estimators shrink very fast withthe number of time periods. In Figure 7 we see that linear probability model estimators have

    probability limits that usually fall outside the identied set for the marginal effect.

    For the probit, Figure 8 shows that the model parameter is not point identied, but the size

    of the identied set shrinks very fast with the number of time periods. The identied sets andlimits of xed effects estimators in Figures 9 13 are analogous to the results for logit.

    7 Estimation

    7.1 Minimum Distance Estimator

    In multinomial models with discrete regressors the complete description of the DGP is provided

    by the parameter vector

    ( , X ) , = ( jk , j = 1 ,...,J,k = 1 ,...,K ) X = ( k , k = 1 ,...,K ) ,

    where jk = Pr( Y = Y j |X = X k ), k = Pr( X = X k ).

    We denote the true value of this parameter vector by ( P , P X ) , and the nonparametric empiricalestimates by ( P , P X ) . As it is common in regression analysis, we condition on the observeddistribution of X by setting the true value of the probabilities of X to the empirical ones, that

    is,

    X

    = P X

    , P X

    = P X

    .Having xed the distribution of X , the DGP is completely described by the conditional choice

    probabilities .Our minimum distance estimator is the solution to the following quadratic problem:

    Bn = B : T ( ; P ) min T ( ; P ) + n ,where B is the parameter space, n is a positive cut-off parameter that shrinks to zero with the

    sample size, as in Chernozhukov, Hong, and Tamer (2007), and

    T ( ; P ) = minQ=( Q 1 ,...,Q k )Q j,k

    jk (P ) P jk C L Y j | X k , , Qk (d ) 2 ,where Q is the set of conditional distributions for with J points of support for each covariate

    value index k, that is, for S the unit simplex in R J and km the Dirac delta function at km ,

    Q = Q := ( Q1,...,Q k ) : Qk (d ) =J

    m =1km km ( )d, ( k1, . . . , kJ )C , (k1, . . . , kJ )S,k .

    19

  • 7/31/2019 2008_14_Chernozhukov

    22/64

    Here we make use of Lemma 10 that tells us that we can obtain a maximizing solution for Qk asa discrete distribution with at most J points of support for each k. Alternatively, we can write

    more explicitly

    T ( ; P ) = min k =( k 1 ,..., kJ )C ,kk =( k 1 ,..., kJ )S,k j,k

    jk (P ) P jk J

    m =1km L Y j | X k , km ,

    2

    . (26)

    In the appendix we give a computational algorithm to solve this problem.For estimation and inference it is important to allow for the possibility that the postulated

    model is not perfectly specied, but still provides a good approximation to the true DGP. Inthis case, when the conditional choice probabilities are misspecied, Bn estimates the identied

    set for the parameter of the best approximating model to the true DGP with respect to a chi-square distance. This model is obtained by projecting the true DGP P onto , the space of conditional choice probabilities that are compatible with the model. In particular, the projectionP corresponds to the solution of the minimum distance problem:

    P = (P )arg min W (, P ), W (, P ) = j,k

    w jk (P )(P jk jk )2, (27)

    where

    = { : jk =J

    m =1km L(Y j | X k , km , ),

    ( k1, . . . , kJ )

    C , (k1, . . . , kJ )

    S,

    B,

    ( j, k )

    }.

    To simplify the exposition, we will assume throughout that P is unique. Of course, when P ,then P = P and the assumption holds trivially. 2 The identied set for the parameter of thebest approximating model is

    B= B :QQ s.t. L Y j | X k , , dQk ( ) = P jk ,( j, k ) ,i.e., the values of the parameter that are compatible with the projected DGP P = ( P jk , j =1,...,J,k = 1 ,...,K ). Under correct specication of the semiparametric model, we have that

    P

    = P and B

    = B .

    We shall use the following assumptions.2 Otherwise, the assumption can be justied using a genericity argument similar to that presented in Newey

    (1986), see Appendix. For non-generic values, we can simply select one element of the projection using anadditional complete ordering criterion, and work with the resulting approximating model. In practice, we neverencountered a non-generic value.

    20

  • 7/31/2019 2008_14_Chernozhukov

    23/64

    Assumption 1: (i) The function F dened in (12) is continuous in (, ), so that theconditional choice probabilities L jk (, ) = L Y j | X k , , are also continuous for all ( j, k );(ii) BB for some compact set B; (iii) i has a support contained in a compact set C ; and (iv) the weights jk (P ) are continuous in P at P , and 0 < jk (P ) < for all ( j, k ).

    Assumption 1(i) holds for commonly used semiparametric models such as logit and probitmodels. The condition 1(iv) about the weights is satised by the chi-square weights jk (P ) =P k / P jk if P jk > 0, ( j, k ).

    In some results, we also employ the following assumption.Assumption 2: Every B

    is regular at P in the sense that, for any sequence n P ,there exists a sequence n arg min B T (, n ) such that n .

    In a variety of cases the assumption of regularity appears to be a good one. First of all, theassumption holds under point identication, as in the logit model, by the standard consistency

    argument for maximum likelihood/minimum distance estimators. Second, for probit and othersimilar models, we can argue that this assumption can also be expected to hold when the true

    distribution of the individual effect i is absolutely continuous, with the exception perhaps of very non-regular parameter spaces and non-generic situations.

    To explain the last point, it is convenient to consider a correctly specied model for sim-plicity. Let the vector of model conditional choice probabilities for ( Y 1,....,Y J ) be Lk (, ) (L1k(, ),..., LJk (, )) . Let k ( ) {Lk (, ) : C}and let Mk ( ) be the convex hull of k ( ). In the case of probit the specication is non-trivial in the sense that Mk ( ) possesses anon-empty interior with respect to the J dimensional simplex. For every

    B and some Qk ,

    we have that L jk ( , Q) = P jk for all ( j, k ), that is, ( P 1k , ..., P Jk ) Mk ( ) for all k. More-over, under absolute continuity of the true Qwe must have ( P 1k , ..., P Jk )interior Mk ( 0)for all k, where 0B is the true value of . Next, for any

    in the neighborhood of 0, we

    must have ( P 1k ,..., P Jk )interior Mk ( ) for all k, and so on. In order for a point to belocated on the boundary of B we must have that ( P 1k , ..., P Jk ) Mk ( ) for some k. Thus, if the identied set has a dense interior, which we veried numerically in a variety of examples forthe probit model, then each point in the identied set must be regular. Indeed, take rst a point

    in the interior of B . Then, for any sequence n P

    , we must have (1k

    , ..., Jk

    )

    Mk( )

    for all k for large n, so that T ( ; n ) = 0 for large n. Thus, there is a sequence of points

    n in argmin B T ( ; n ) converging to . Now take the point on the boundary of B ,then for each > 0, there is in the interior such /2 and such that there is asequence of points n in arg min B T ( ; n ) and a nite number n( ) such that for all n n( ),

    n /2. Thus, for all n n( ), n . Since > 0 is arbitrary, it follows that is regular.

    21

  • 7/31/2019 2008_14_Chernozhukov

    24/64

    We can now give a consistency result for the quadratic estimator.

    Theorem 12: If Assumptions 1 holds and n log n/n then

    dH (Bn , B) = oP (1) ,

    where dH is the Hausdorff distance between sets

    dH (Bn , B) = max sup n B n

    inf B | n | , sup B inf n B n | n | .

    Under Assumption 2 the result holds for n = 0 .

    Moreover, under Assumption 1 the model-predicted probabilities are consistent, for any n Bn , and each j and k,

    P jk =J

    m =1km ( n )L Y j | X k , km ( n ), n p P jk , (28)

    where {km ( n ), km ( n ), k, m}is a solution to the minimum distance problem (26) for any n 0, where we assume that P is unique.

    7.2 Marginal Effects

    We next consider the problem of estimation of marginal effects, which is of our prime interest. An

    immediate issue that arises is that we can not directly use the solution to the minimum distanceproblem to estimate the marginal effects. Indeed, the constraints of the linear programming

    programs for these effects in (25) many not hold for any

    Bn when

    P is replaced by P

    due to sampling variation or under misspecication. In order to resolve the infeasibility issue,

    we replace the nonparametric estimates P jk by the probabilities predicted by the model P jkas dened in (28), and we re-target our estimands to the marginal effects dened in the best

    approximating model.To describe the estimator of the bounds for the marginal effects, it is convenient to introduce

    some notation. Let

    k (, ) = min k , k D 1 J

    m =1 [F (x + km ) F (x + km )]kms.t. jk =

    J m =1 L Y j | X k , km , km j,

    k = ( k1, . . . , kJ )

    C ,k = ( k1, . . . , kJ )S,

    (29)

    andk (, ) = max k , k D

    1 J j =1 [F (x + km ) F (x + km )]km

    s.t. jk =J m =1 L Y j | X k , km , km j,

    k = ( k1, . . . , kJ )C ,k = ( k1, . . . , kJ )

    S,

    (30)

    22

  • 7/31/2019 2008_14_Chernozhukov

    25/64

    where = ( jk , j = 1 ,...,J,k = 1 ,...,K ) denotes the the projection of onto , i.e., =

    () as dened in (27). Thus, the upper and lower bounds on the true marginal effects of the

    best approximating model take the form:

    k

    = min B

    k(,

    P ),

    k= max

    B

    k(,

    P ).

    Under correct specication, these correspond to the lower and upper bounds on the marginaleffects in (14) and (15). We estimate the bounds by

    k = min B nk (, P ),

    k = max B nk (, P ).

    Theorem 13: If Assumptions 1 is satised and n log n/n then

    k =

    k + o p(1) ,

    k =

    k + o p(1) .

    Under Assumption 2 the result holds for n = 0 .

    8 Inference

    8.1 Modied Projection Method

    The following method projects a condence region for conditional choice probabilities onto asimultaneous condence region for all possible marginal effects and other structural parameters.

    If a single marginal effect is of interest, then this approach is conservative; if all (or many)marginal effects are of interest, then this approach is sharp (or close to sharp). In the next

    section, we will present an approach that appears to be sharp, at least in large samples, when aparticular single marginal effect is of interest.

    It is convenient to describe the approach in two stages.Stage 1. The nonparametric space N of conditional choice probabilities is the product of

    K simplex sets S of dimension J , that is, N = SK . Thus we can begin by constructing acondence region for the true choice probabilities P by collecting all probabilities N thatpass a goodness-of-t test:

    CR 1 (P ) = N : W (, P ) c1 ( 2K (J 1) ) ,where c1 ( 2K (J 1) ) is the (1 )-quantile of the 2K (J 1) distribution and W is the goodness-of-t statistic:

    W (, P ) = n j,k

    P k(P jk jk )2

    jk.

    23

  • 7/31/2019 2008_14_Chernozhukov

    26/64

    Stage 2. To construct condence regions for marginal effects and any other structural pa-rameters we project each CR 1 (P ) onto , the space of conditional choice probabilitiesthat are compatible with the model. We obtain this projection () by solving the minimum

    distance problem:

    () = arg min

    W (, ), W (, ) = n j,k

    P k( jk jk )2

    jk. (31)

    The condence regions are then constructed from the projections of all the choice probabilities in

    CR 1 (P ). For the identied set of the model parameter, for example, for each CR 1 (P )we solve

    B() = B :QQ s.t. L Y j | X k , , dQk ( ) = jk ,( j, k ), = () .(32)

    Denote the resulting condence region as

    CR 1 (B) = {B() : CR 1 (P )}.We may interpret this set as a condence region for the set B collecting all values that arecompatible with the best approximating model P . Under correct specication, B is just theidentied set B .

    If we are interested in bounds on marginal effects, for each CR 1 (P ) we get

    k() = min

    B

    ()

    k(, ), k () = max

    B

    ()k (, ), k = 1 ,...,K.

    Denote the resulting condence regions as

    CR 1 [k ,

    k ] = {[k () , k ()] : CR 1 (P )}.These sets are condence regions for the sets [ k ,

    k ], where

    k and

    k are the lower and upper

    bounds on the marginal effects induced by any best approximating model in ( B, P ). Undercorrect specication, these will include the upper and lower bounds on the marginal effect [ k , k ]

    induced by any true model in ( B, P ).In a canonical projection method we would implement the second stage by simply intersecting

    CR 1 (P ) with , but this may give an empty intersection either in nite samples or undermisspecication. We avoid this problem by using the projection step instead of the intersection,

    and also by re-targeting our condence regions onto the best approximating model. In order tostate the result about the validity of ourmodied projection method in large samples, let be

    the set of vectors with all components bounded away from zero by some > 0.

    24

  • 7/31/2019 2008_14_Chernozhukov

    27/64

    Theorem 14: Suppose Assumption 1 holds, then for (any sequence of true parameter values)

    P 0 = ( P , P X )

    limn

    Pr P 0P CR 1 (P )B

    CR 1 (B)

    [k ,

    k ] CR 1 [

    k ,

    k ],k

    = 1

    .

    8.2 Perturbed Bootstrap

    In this section we present an approach that appears to be sharper than the projection method,

    at least in large samples, when a particular single marginal effect is of interest. The estima-tors for parameters and marginal effects are obtained by nonlinear programming subject to

    data-dependent constraints that are modied to respect the constraints of the model. The dis-tributions of these highly complex estimators are not tractable, and are also non-regular in the

    sense that the limit versions of these distributions do not vary with perturbations of the DGPin a continuous fashion. This implies that the usual bootstrap is not consistent. To overcome

    all of these difficulties we will rely on a variation of the bootstrap, which we call the perturbedbootstrap.

    The usual bootstrap computes the critical value the -quantile of the distribution of atest statistic given a consistently estimated data generating process (DGP). If this critical

    value is not a continuous function of the DGP, the usual bootstrap fails to consistently estimatethe critical value. We instead consider the perturbed bootstrap, where we compute a set of

    critical values generated by suitable perturbations of the estimated DGP and then take themost conservative critical value in the set. If the perturbations cover at least one DGP that

    gives a more conservative critical value than the true DGP does, then this approach yields avalid inference procedure.

    The approach outlined above is most closely related to the Monte-Carlo inference approachof Dufour (2006); see also Romano and Wolf (2000) for a nite-sample inference procedure for

    the mean that has a similar spirit. In the set-identied context, this approach was rst appliedin the MIT thesis work of Rytchkov (2007); see also Chernozhukov (2007).

    Recall that the complete description of the DGP is provided by the parameter vector

    ( , X ) , where = ( jk , j = 1 ,...,J,k = 1 ,...,K ) , X = ( k , k = 1 ,...,K ) , jk = Pr( Y =Y j |X = X k), and k = Pr( X = X k ). The true value of the parameter vector is ( P , P X ) andthe nonparametric empirical estimate is ( P , P X ) . As before, we condition on the observed

    distribution of X and thus set X = P X and P X = P X .We consider the problem of performing inference on a real parameter . For example,

    25

  • 7/31/2019 2008_14_Chernozhukov

    28/64

    can be an upper (or lower) bound on the marginal effect k such as

    () = max B () ,QQ

    D 1 [F (x + ) F (x + )]Qk (d ) s.t. L jk (, Q k) = jk , j,where = ( jk , j = 1 ,...,J,k = 1 ,...,K ) denotes the projection of onto the model space, as

    dened in (31), and B() is the corresponding projection for the identied set of the parameterdened as in (32). Alternatively, can be an upper (or lower) bound on a scalar functional c

    of the parameter . Then we dene

    () = max B ()

    c .

    As before, we project onto the model space in order to address the problem of infeasibility of

    constraints dening the parameters of interest under misspecication or sampling error. Undermisspecication, we interpret our inference as targeting the parameters of interest in the best

    approximating model.In order to perform inference on the true value = (P ) of the parameter, we use the

    statistic

    S n = ,where = (P ). Let Gn (s, ) denote the distribution function of S n () = (), when thedata follow the DGP . The goal is to estimate the distribution of the statistic S n under the

    true DGP = P , that is, to estimate Gn (s, P ).The method proceeds by constructing a condence region CR 1 (P ) that contains the true

    DGP P with probability 1 , close to one. For efficiency purposes, we also want the condenceregion to be an efficient estimator of P , in the sense that as n , dH (CR 1 (P ), P ) =O p(n1/ 2),where dH is the Hausdorff distance between sets. Specically, in our case we use

    CR 1 (P ) = {N : W (, P ) c1 ( 2K (J 1) )},where c1 ( 2K (J 1) ) is the (1 )-quantile of the 2K (J 1) distribution and W is the goodness-of-t statistic:

    W (, P ) = n j,k

    P k(P jk jk )2

    jk.

    Then we dene the estimates of lower and upper bounds on the quantiles of Gn (s, P ) asG 1n (, P )/G

    1n (, P ) = inf / supCR 1 (P ) G

    1n (, ), (33)

    where G 1n (, ) = inf {s : Gn (s, ) }is the -quantile of the distribution function Gn (s, ).Then we construct a (1 ) 100% condence region for the parameter of interest as

    CR 1 () = ,

    26

  • 7/31/2019 2008_14_Chernozhukov

    29/64

    where, for = 1 + 2,

    = G 1(1 1, P ), = G 1( 2, P ).

    This formulation allows for both one-sided intervals (either 1 = 0 or 2 = 0) or two-sided

    intervals ( 1 = 2 = / 2).The following theorem shows that this method delivers (uniformly) valid inference on the

    parameter of interest.Theorem 15. Suppose Assumption 1 holds, then for (any sequence of true parameter values)

    P 0 = ( P , P X ) lim

    n Pr P 0 (, ) 1 .

    In practice, we use the following computational approximation to the procedure describedabove:

    1. Draw a potential DGP r = ( r 1,..., rK ), where rk M(nP k , (P 1k ,...,P Jk )) / (nP k)and Mdenotes the multinomial distribution.

    2. Keep r if it passes the chi-square goodness of t test at the level, using K (J 1)degrees of freedom, and proceed to the next step. Otherwise reject, and repeat step 1.

    3. Estimate the distribution Gn (s, r ) of S n ( r ) by simulation under the DGP r .

    4. Repeat steps 1 to 3 for r = 1 ,...,R , obtaining Gn (s, r ), r = 1 ,...,R .

    5. Let G 1(, P )/ G 1(, P ) = min / max{G 1n (, 1),...,G 1n (, R )}, and construct a 1 condence region for the parameter of interest as CR 1 () = , , where = G

    1(1 1, P ), = G

    1( 2, P ), and 1 + 2 = .

    The computational approximation algorithm is necessarily successful, if it generates at least

    one draw of DGP r that gives more conservative estimates of the tail quantiles than the trueDGP does, namely [ G 1( 2, P ), G 1(1 1, P )][G 1( 2, r ), G

    1(1 1, r )].

    9 Empirical Example

    We now turn to an empirical application of our methods to a binary choice panel model of female

    labor force participation. It is based on a sample of married women in the National LongitudinalSurvey of Youth 1979 (NLSY79). We focus on the relationship between participation and the

    presence of young children in the years 1990, 1992, and 1994. The NLSY79 data set is convenientto apply our methods because it provides a relatively homogenous sample of women between 25

    27

  • 7/31/2019 2008_14_Chernozhukov

    30/64

    and 33 year-old in 1990, what reduces the extent of other potential confounding factors that mayaffect the participation decision, such as the age prole, and that are more difficult to incorporate

    in our methods. Other studies that estimate similar models of participation in panel data include

    Heckman and MaCurdy (1980), Heckman and MaCurdy (1982), Chamberlain (1984), Hyslop

    (1999), Chay and Hyslop (2000), Carrasco (2001), Carro (2007), and Fern andez-Val (2008).The sample consists of 1,587 married women. Only women continuously married, not stu-

    dents or in the active forces, and with complete information on the relevant variables in the entire

    sample period are selected from the survey. Descriptive statistics for the sample are shown inTable 2. The labor force participation variable ( LF P ) is an indicator that takes the value one if

    the woman employment status is in the labor force according to the CPS denition, and zerootherwise. The fertility variable ( kids ) indicates whether the woman has any child less than 3

    year-old. We focus on very young preschool children as most empirical studies nd that theirpresences have the strongest impact on the mother participation decision. LF P is stable across

    the years considered, whereas kids is increasing. The proportion of women that change fertilitystatus grows steadily with the number of time periods of the panel, but there are still 49% of

    the women in the sample for which the effect of fertility is not identied after 3 periods.The empirical specication we use is similar to Chamberlain (1984). In particular, we esti-

    mate the following equation

    LF P it = 1 { kids it + i + it 0}, (34)where i is an individual specic effect. The parameters of interest are the marginal effectsof fertility on participation for different groups of individuals including the entire population.

    These effects are estimated using the general conditional mean model and semiparametric logitand probit models described in Sections 2 and 5, together with linear and nonlinear xed ef-

    fects estimators. Analytical and Jackknife large- T bias corrections are also considered, andconditional xed effects estimates are reported for the logit model. 3 The estimates from the

    general model impose monotonicity of the effects. For the semiparametric estimators, we usethe algorithm described in the appendix with penalty n = 1 / (n log n) and iterate the quadratic

    program 3 times with initial weights w jk = nP k . This iteration makes the estimates insen-

    sitive to the penalty and weighting. We search over discrete distributions with 23 supportpoints at {, 4, 3.6, ..., 3.6, 4, }in the quadratic problem, and with 163 support pointsat {, 8, 7.9,..., 7.9, 8, }in the linear programming problems. The estimates are basedon panels of 2 and 3 time periods, both of them starting in 1990.

    Tables 3 and 4 report estimates of the model parameters and marginal effects for 2 and 33 The analytical corrections use the estimators of the bias based on expected quantities in Fern andez-Val (2008).

    The Jackknife bias correction uses the procedure described in Hahn and Newey (2004).

    28

  • 7/31/2019 2008_14_Chernozhukov

    31/64

    period panels, together with 95% condence regions obtained using the procedures describedin the previous section. For the general model these regions are constructed using the normal

    approximation (95% N ) and nonparametric bootstrap with 200 repetitions (95% B ). For thelogit and probit models, the condence regions are obtained by the modied projection method

    (95% MP ), where the condence interval for P in the rst stage is approximated by 50,000DGPs drawn from the empirical multinomial distributions that pass the goodness of t test;and the perturbed bootstrap method (95% P B ) with R = 100, = .01, 1 = 2 = .02, and 200simulations from each DGP to approximate the distribution of the statistic. We also include

    condence intervals obtained by a canonical projection method (95% MP ) that intersects thenonparametric condence interval for P with the space of probabilities compatible with thesemiparametric model :

    CR 1 (P ) = : W (, P ) c1 ( 2K (J 1) ) .For the xed effects estimators, the condence regions are based on the asymptotic normalapproximation. The semiparametric estimates are shown for n = 0, i.e., for the solution that

    gives the minimum value in the quadratic problem.Overall, we nd that the estimates and condence regions based on the general conditional

    mean model are too wide to provide informative evidence about the relationship between par-ticipation and fertility for the entire population. The semiparametric estimates seem to offer a

    good compromise between producing more accurate results without adding too much structureto the model. Thus, these estimates are always inside the condence regions of the general

    model and do not suffer of important efficiency losses relative to the more restrictive xed ef-fects estimates. Another salient feature of the results is that the misspecication problem of

    the canonical projection method clearly arises in this application. Thus, this procedure givesempty condence regions for the panel with 3 periods. The modied projection and perturbed

    bootstrap methods produce similar (non-empty) condence regions for the model parameters

    and marginal effects.

    10 Possible Extensions

    Our analysis is yet conned to models with only discrete explanatory variables. It would be

    interesting to extend the analysis to models with continuous explanatory variables. It may bepossible to come up with a sieve-type modication. We expect to obtain a consistent estimator

    of the bound by applying the semiparametric method combined with increasing number of par-titions of the support of the explanatory variables, but we do not yet have any proof. Empirical

    likelihood based methods should work in a straightforward manner if the panel model of interest

    29

  • 7/31/2019 2008_14_Chernozhukov

    32/64

    is characterized by a set of moment restrictions instead of a likelihood. We may be able toimprove the nite-sample property of our condence region by using Bartlett type corrections.

    11 Appendix

    11.1 Proofs

    Proof of Theorem 1: By eq. (3),

    t(X kt r k )E [Y it | X i = X k ] = T r k (1 r k ) m(1, )Qk(d ) (35)

    + T (1 r k)(r k ) m(0, )Qk(d ) = T 2kk .Note also that X i = r k when X i = X k . Then by the law of large numbers,

    i,t

    (X it X i )2/np

    E [t

    (X it X i )2] =k

    P kt

    (X kt r k )2 =k

    P kT 2k ,

    i,t

    (X it X i )Y it /np

    E [t

    (X it X i )Y it ] =k

    P kt

    (X kt r k )E [Y it | X i = X k ]

    =k

    P kT 2kk .

    Dividing and applying the continuous mapping theorem gives the result. Q.E.D.

    Proof of Theorem 2: The set of X i where r i > 0 and r i > 0 coincides with the set forwhich X i = X k for k K. On this set it will be the case that r i and r i are bounded awayfrom zero. Note also that for t such that X kt = x we have E [Y i t | X i = X k] = m(x, )Q

    k (d ).Therefore, for r k = # {t : X kt = x}/T and r k = # {t : X kt = x}/T , by the law of large numbers,

    1n

    n

    i=1

    1(r i > 0)1(r i > 0){T t=1 dit Y it

    T r i T t=1 dit Y it

    T r i }/D p

    E [1(r i > 0)1(r i > 0){T t=1 dit Y it

    T r i T t=1 dit Y it

    T r i }]/D

    = E [1(r i > 0)1(r i > 0)

    {

    T t=1 dit E [Y it | X i ]

    T r i

    T t=1 dit E [Y it | X i]

    T r i }]/D

    =kK

    P k{T r k m(x, )Q

    k (d )T r k

    T r k m(x, )Q

    k (d )T r k }/D =

    kKP kk ,

    1n

    n

    i=1

    1(r i > 0)1(r i > 0)p

    E [1(r i > 0)1(r i > 0)] =kK

    P k .

    Dividing and applying the continuous mapping theorem gives the result. Q.E.D.

    30

  • 7/31/2019 2008_14_Chernozhukov

    33/64

    Proof of Lemma 3: As before let Qk ( ) denote the conditional CDF of given X i = X k .

    Note that

    mkt =E [Y it | X i = X k ]

    D= m(X

    kt , )Qk(d )

    D.

    Also we have

    k = ( )Qk (d ) = m(x, )Q

    k (d )D m(x, )Q

    k (d )D

    .

    Then if there is t and t such that X kt

    = x and X kt = x

    mkt mkt = m(x, )Q

    k(d )D m(x, )Q

    k (d )D

    = k .

    Also, if B m(x, )/D Bu , then for each k,

    B

    m(x, )Qk (d )

    D Bu ,

    Bu

    m(x, )Qk (d )

    D B

    Then if there is t such that X kt

    = x we have

    mkt Bu = m(x, )Q

    k (d )D Bu k m(x, )Q

    k (d )D B = mkt B .

    The second inequality in the statement of the theorem follows similarly.

    Next, if ( ) has the same sign for all and if for some k there is t and t such thatX k

    t= x and X k

    t = x, then sgn (( )) = sgn (k ). Furthermore, since sgn (k) = sgn (k ) is

    then known for all k, if it is positive the lower bounds, which are nonpositive, can be replaced byzero, while if it is negative the upper bounds, which are nonnegative, can be replaced by zero.

    Q.E.D.

    Proof of Theorem 4: See text.

    Proof of Theorem 5: Let Z iT = min { T t=1 1(X it = x)/T, T t=1 1(X it = x)/T }. Notethat if Z iT > 0 then 1( AiT ) = 1 for the event AiT that there exists t such that X i t = x andX i t = x. By the ergodic theorem and continuity of the minimum, conditional on i we have

    Z iT as

    b( i ) = min

    {Pr( X it = x

    | i), Pr( X it = x

    | i )

    }> 0. Therefore Pr( AiT

    | i)

    Pr( Z iT > 0 | i) 1 for almost all i . It then follows by the dominated convergence theoremthat

    Pr( AiT ) = E [Pr( AiT | i )] 1.Also note that Pr( AiT ) = 1 P 0 kK P k kK P k , so that

    | 0| (Bu B )(P 0 +kK

    P k +kK

    P k ) 0.Q.E.D.

    31

  • 7/31/2019 2008_14_Chernozhukov

    34/64

    Proof of Theorem 6: Let P 1 and P K be as in equation (6). By the Markov assumption,

    P 1 = Pr( X i1 = = X iT = 0) = E [Pr( X i1 = = X iT = 0 | i)]= E [T t= J +1 Pr( X it = 0 | X i,t 1 = = X i,t J = 0 , i)Pr( X iJ = = X i,t J = 0 | i )]

    E [( p1i )

    T J

    ].

    P K E [( pK i )T J ].The rst bound then follows as in (6). The second bound then follows from the condition

    pki 1 for k {1, K }. Now suppose that there is a set A of possible i such that Pr( A) > 0,Q i = Pr( X i1 = = X iJ = 0 | i) > 0 and p1i = 1 Then

    P 1 = E [( p1i )T J Qi ] E [1( iA)( p1i )T J Qi ] = E [1( iA)Qi ] > 0.Therefore, for all T the probability P 1 is bounded away from zero, and hence 0 oru 0.Q.E.D.

    Proof of Theorem 7: Note that every X i X t (x) has X it = x. Also, the X is for s > tare completely unrestricted by X iX t (x). Therefore, it follows by the key implication that

    E [Y it | X iX t (x)] = m(x, )Q(d | X iX t (x)) .Then by iterated expectations,

    m(x, )Q(d ) = P (x)

    m(x, )Q(d | X iX (x))

    +T

    t=1Pr( X iX t (x)) m(x, )Q(d | X iX t (x))= P (x) m(x, )Q(d | X iX (x)) + E [

    T

    t=11(X iX t (x))Y it ].

    Using the bound and dividing by D then gives

    E [T

    t=11(X iX t (x))Y it ]/D + P (x)B m(x, )Q(d )/D

    E [T

    t=11(X iX t (x))Y it ]/D + P (x)Bu .

    Differencing this bound for x = x and x = x gives the result. Q.E.D.

    Proof of Theorem 8: The size of the identied set for the marginal effect is

    kk = maxQ kQk , B D 1 [F ( + )F ()]Qk (d ) minQ kQk , B D 1 [F ( + )F ( )]Qk (d ),

    32

  • 7/31/2019 2008_14_Chernozhukov

    35/64

    where Qk = {Qk : L Y j | X k , , Qk (d ) = P jk , j = 1 ,...,J }. The feasible set of distribu-

    tions Qk can be further characterized in this case. Let F T (, ) := (1 , F (X k1 + ),...,F (X kT +)) and F J (, ) denote the J 1 power vector of F T (, ) including all the different productsof the elements of F T (, ), i.e.,

    F J (, ) = (1 ,...,F (X kT + ), F (X k1 + )F (X k2 + ), ....,T

    t=1F (X kt + )) .

    Note that L Y j | X k , , = T t=1 F (X kt + )Y j

    t {1F (X kt + )}1 Y j

    t , so the model probabil-

    ities are linear combinations of the elements of F J (, ). Therefore, for k = ( P 1k ,..., P Jk ) wehave Qk = {Qk : AJ F J (, )Qk (d ) = k}, where AJ is a J J matrix of known constants.The matrix AJ is nonsingular, so we have:

    Qk = Qk :

    F J (, )Qk (d ) = M k ,

    where the J 1 vector M k = A 1J k is identied from the data.Now we turn to the analysis of the size of the identied sets. We focus on the case where

    k = 1, i.e., X k is a vector of zeros, and a similar argument applies to k = K . For k = 1 we have

    that F (X kt + ) = F ( ) for all t , so the power vector only has T + 1 different elements givenby (1, F (),...,F ( )T ). The feasible set simplies to:

    Qk = Qk : F ( )t Qk (d ) = M kt , t = 0 ,...,T ,where the moments M kt are identied by the data. Here F ()Qk (d ) = M k1 is xed in Qk ,so the size of the identied set is given by:

    k k = maxQ kQk , B D 1 F ( + ) Qk (d ) minQkQk , B D 1 F ( + ) Qk (d ).

    By a change of variable, Z = F (), we can express the previous problem in a form that is

    related to a Hausdorff truncated moment problem:

    k k = maxGkGk , B D 1

    1

    0h (z)Gk (dz) minGkGk , B D

    1

    1

    0h (z)Gk (dz), (36)

    where Gk = {Gk : 1

    0 zt Gk (dz) = M kt , t = 0 ,...,T }, h (z) = F ( + F 1(z)), and F 1 is the

    inverse of F .

    If the objective function is r times continuously differentiable, h Cr [0, 1], with uniformlybounded r -th derivative, hr (z) h r , then we can decompose h using standard approxi-mation theory techniques as

    h (z) = P (z, T ) + R (z, T ), (37)

    33

  • 7/31/2019 2008_14_Chernozhukov

    36/64

    where P (z, T ) is the T -degree best polynomial approximation to h and R (z, T ) is the re-mainder term of the approximation, see, e.g., Judd (1998) Chap. 3. By Jacksons Theorem the

    remainder term is uniformly bounded by

    R (z, T )

    (T r )!

    T !

    4

    rh r = O T

    r , (38)

    as T , and this is the best possible uniform rate of approximation by a T -degree polynomial.Next, note that for any GkGk we have that

    10 P (z, T )Gk (dz) is xed, since the rst T

    moments of Z are xed at Gk . Moreover, 1

    0 P (z, T )Gk(dz) is xed at B if the parameter is

    point identied, B = { }. Then, we havek k = maxGkGk

    1

    0R (z, T )Gk(dx) minGkGk

    1

    0R (z, T )Gk (dx) 2h r = O T r . (39)

    To complete the proof, we need to check the continuous differentiability condition and thepoint identication of the parameter for the logit model. Point identication follows from Cham-

    berlain (1992). For differentiability, note that for the logit model

    h (z) =ze

    1 (1 e )z, (40)

    with derivativesh r (z) = r !

    e (1 e )r 1[1(1 e )z]r

    . (41)

    These derivatives are uniformly bounded by h r = r ! e| | (e| | 1|)r 1 < for any nite r .

    Q.E.D.

    Proof of Theorem 9: Note that for T = 2 and X binary, we have that K = 4. Let

    X 1 = (0 , 0), X 2 = (0 , 1), X 3 = (1 , 0), and X 4 = (1 , 1). By Lemma 3, I is identied by

    I = P 2 [Pr{Y = (0 , 1) | X 2}Pr{Y = (1 , 0) | X 2}]+P 3 [Pr{Y = (1 , 0) | X 3}Pr {Y = (0 , 1) | X 3}].The probability limit of the xed effects estimator for this effect is

    I =3

    k=2P k [Pr{Y = (0 , 1) | X k}+ Pr {Y = (1 , 0) | X k}][F (/ 2) F (/ 2)].

    The condition for consistency I = I can be written as

    F (/ 2) = P 2 Pr{Y = (0 , 1) | X 2}+ P 3 Pr {Y = (1 , 0) | X 3}}3k=2 P k [Pr{Y = (0 , 1) | X k}+ Pr {Y = (1 , 0) | X k}]

    ,

    but this is precisely the rst order condition of the program (16). This result follows, after some

    algebra and using the symmetry property of F , by solving the prole problem

    = argmax

    K

    k=1P k [Pr{Y = (0 , 1) | X k}log F ( X k/ 2)+Pr {Y = (1 , 0) | X k}log F ( X k/ 2)],

    34

  • 7/31/2019 2008_14_Chernozhukov

    37/64

    where X k = X k2 X k1 . Q.E.D.Proof of Lemma 10: First, by B, we have that T ( ;P ) = 0 and therefore any

    Qk arg max Q kJ j =1 jk (P ) (P jk L jk (, Q k ))2 satises L jk (, Q k ) = P jk j , for each k.

    Let the vector of conditional choice probabilities for ( Y 1

    ,....,Y J

    ) be

    Lk (, ) L Y 1 | X k , , , ..., L Y J | X k , , .Let k ( ) {Lk (, ) : C}. Note that, for each B , k ( ) is a closed and bounded setdue to compactness of C , and has at most dimension J 1 since the sum of the elements of Lk (, ) is one . Now, let Mk ( ) denote the convex hull of k ( ). For any B we havethat there is at least one Qk such that L jk (, Q k ) = P jk j , i.e.,

    (P 1k , ..., P Jk )Mk ( ) .By Caratheodory Theorem any point in Mk ( ) can be written as a convex combination of atmost J vectors located in k ( ). Then, we can write

    (P 1k ,..., P Jk ) =J

    m =1km Lk ( km , ) ,

    where (k1,..., kJ ) is on the unit simplex S of dimension J . Thus, the discrete distribution withJ support points at ( k1,..., kJ ) and probabilities ( k1,..., kJ ) solves the population problem

    for Qk . The result also follows from Lindsay (1995, Theorem 18, p. 112, and Theorem 21, p.

    116) (though Lindsay does not provide proofs for his theorems). Q.E.D.

    Proof of Lemma 11: For B , let Qk = {Qk : L jk (, Q k ) = P jk , j = 1 ,...,J }. LetQk Qk denote some maximizing value such that

    k = D 1 C [F x + F x + ]Qk (d ) .

    Note that, for any > 0 we can nd a distribution QM k Qk with a large number M J of support points ( 1,..., M ) such that

    k < D 1 C [F x + F x + ]QM k (d ) k .Our goal is to show that given such QM k it suffices to allocate its mass over only at most J support points. Indeed, consider the problem of allocating ( k1,..., kM ) among ( 1,..., M ) in

    order to solve

    max(k 1 ,..., kM )

    M

    m =1[F x + m F x + m ]km

    35

  • 7/31/2019 2008_14_Chernozhukov

    38/64

    subject to the constraints:

    km 0, m = 1 , . . . , M M

    m =1

    km L Y j | X k , m , = P jk , j = 1 ,...,J,M

    m =1km = 1 .

    This a linear program of the form

    maxR M

    c such that 0, A = b, 1 = 1 ,and any basic feasible solution to this program has M active constraints, of which at most

    rank (A) + 1 can be equality constraints. This means that at least M rank( A) 1 of activeconstraints are the form km = 0, see, e.g., Theorem 2.3 and Denition 2.9 (ii) in Bertsimas andTsitsiklis (1997). Hence a basic solution to this linear programming problem will have at leastM J zeroes, that is at most J strictly positive km s.4 Thus, we have shown that given theoriginal QM k with M J points of support there exists a distribution Q

    Lk Qk with just J

    points of support such that

    k < D 1 C [F x + F x + ]QM k (d ) D 1 C [F x + F x + ]QLk (d ) k .This construction works for every > 0.

    The nal claim is that there exists a distribution QLk Q

    k with J points of support( k1,..., kJ ) such that

    k = D 1 C [F x + F x + ]QLk (d ) .

    Suppose otherwise, then it must be that

    k > k D 1 C [F x + F x + ]QLk (d ) ,for some > 0 and for all QLk with J points of support. This immediately gives a contradiction

    to the previous step where we have shown that, for any > 0, k and the right hand side canbe brought close to each other by strictly less than . Q.E.D.

    Some Lemmas are useful for proving Theorem 12.4 Note that rank (A ) J 1, since P J j =1 L `Y j | X k , , = 1. The exact rank of A depends on the sequence

    X k , the parameter , the function F , and T . For T = 2 and X binary, for example, rank (A) = J 2 = 2 whenx 1 = x 2 , = 0, or F is the logistic distribution; whereas rank (A) = J 1 = 3 for X k1 = X k2 , = 0, and F is anycontinuous distribution different from the logistic.

    36

  • 7/31/2019 2008_14_Chernozhukov

    39/64

    Lemma A1: Let T (, Q ; ) = j,k jk ()( jk L jk (, Q k)) 2 . If Assumption 1 is satised then, for Q equal to the collection of distributions with support contained in a compact set C ,

    sup B,QQ |T (, Q ; P ) T (, Q ;P )| = oP (1) .

    Proof: Note that we can write

    T (, Q ; P ) T (, Q ;P ) = j,k

    jk (P )(P jk P jk )2 + 2 j,k

    jk (P )(P jk P jk ) (P jk L jk (, Q k ))

    + j,k

    ( jk (P ) jk (P )) (P jk L jk (, Q k ))2 .

    The result then follows from P jk P jk = oP (1) and jk (P ) jk (P ) = oP (1) by the continuousmapping theorem. Q.E.D.

    From Lemma A1, we obtain one-sided uniform convergence:Lemma A2: Let T ( ; ) = inf QQ T (, Q ; ). If Assumption 1 is satised then

    sup B |T ( ; P ) T ( ;P )| = oP (1) .

    Proof: Let Q arg inf QQ T (, Q ; P ) and Q arg inf QQ T (, Q ;P ). By denition of Q andQ , we have uniformly in and for all n,

    T (, Q ; P ) T (, Q ;P ) T (, Q ; P ) T (, Q ;P ) T (, Q ; P ) T (, Q ;P ).Hence

    T (, Q ; P ) T (, Q ;P ) max T (, Q ; P ) T (, Q ;P ) , |T (, Q ; P ) T (, Q ;P )| = oP (1) ,uniformly in by Lemma A1. Q.E.D.

    Lemma A3: If Assumption 1 is satised then T ( ;P ) is continuous in .Proof: By Lemma 10, the problem

    inf QQ

    T (, Q ;P )can be rewritten as

    min( 1 k ,..., Jk )C ,k( 1 k ,..., Jk )S,k j,k

    jk (P ) P jk J

    m =1km L Y j | X k , km ,

    2

    ,

    where J and K are nite, and S denotes the unit simplex in R J . Here, ( 1k , . . . , jk ) and

    (1k , . . . , Jk ) characterize discrete distributions with no more than J points of support. Because

    37

  • 7/31/2019 2008_14_Chernozhukov

    40/64

    the objective function is continuous in ( , 11 , . . . , JK , 11 , . . . , JK ), and because C K SK iscompact, we can apply the theorem of the maximum (e.g. Stokey and Lucas 1989, Theorem

    3.6), and obtain the desired conclusion. Q.E.D.

    Lemma A4: If Assumption 1 is satised then

    sup B |T ( ; P ) T ( ;P )| = OP n

    1 .

    Proof: Let Q arg min QQ T (, Q ;P ). By Lemma 10, we have that P jk = L jk (, Q k ) andT ( ;P ) = 0 B . Then, we havesup B |T ( ; P ) T ( ;P )| = sup B T ( ; P ) sup B T (, Q ; P ) = j,k

    jk (P ) (P jk P jk )2 = OP (n 1),

    where the last equality follows from P jk P jk = OP (n 1/ 2), jk (P ) = jk (P ) + oP (1) by thecontinuous mapping theorem, and J and K being nite. Q.E.D.

    Proof of Theorem 12. The consistency result under Assumption 1 and n log n/nfollows from Theorem 3.1 in Chernozhukov, Hong, and Tamer (2007) with an = n. Indeed, the

    Condition C.1 in Chernozhukov, Hong, and Tamer (2007) follows by Assumption 1 ( B compact),Lemma A3 ( T ( ;P ) continuous), Lemma A2 (uniform convergence of T ( ; P ) to T ( ;P ) in B),and Lemma A4 (uniform convergence of T ( ; P ) to T ( ;P ) in B at a rate n).

    The consistency result under Assumptions 1 and 2 and n = 0 follows from Theorem 3.2 in

    Chernozhukov, Hong, and Tamer (2007) with an = n. It is not difficult to show that Assumption

    3.2 implies condition C.3 in Chernozhukov, Hong, and Tamer (2007), which along with otherconditions veried above, implies the consistency result.

    The second r